4. How to Computethe Similarity BetweenTwo
Text Documents
• Computing the similarity between two text documents
is a common task in NLP/IR, with several practical
applications. It has commonly been used to, for example,
rank results in a search engine or recommend similar
content to readers.
5. Text Similarity
• Our first step is to define what we mean by similarity.
We’ll do this by starting with two examples.
Let’s consider the sentences:
• The teacher gave his speech to an empty room
• There was almost nobody when the professor was
talking.
• Although they convey a very similar meaning, they are
written in a completely different way. In fact, the two
sentences just have one word in common (“the”), and
not a really significant one at that.
6. Document Vectors
• The traditional approach to compute text similarity
between documents is to do so by transforming the
input documents into real-valued vectors. The goal is to
have a vector space where similar documents are “close”,
according to a chosen similarity measure.
• This approach takes the name of Vector Space Model,
and it’s very convenient because it allows us to use
simple linear algebra to compute similarities. We just
have to define two things:
• A way of transforming documents into vectors
• A similarity measure for vectors.
• So, let’s see the possible ways of transforming a text
document into a vector.
7. Document Vectors: an Example
Let’s consider three sentences:
• We went to the pizza place and you ate no pizza at all
• I ate pizza with you yesterday at home
• There’s no place like home
• To build our vectors, we’ll count the occurrences of each word
in a sentence:
8. • Once we have our vectors, we can use
the standard similarity measure for this
situation: cosine similarity. Cosine
similarity measures the angle between the two
vectors and returns a real value between -1 and
1.
• If the vectors only have positive values, like in
our case, the output will actually lie between 0
and 1. It will return 0 when the two vectors are
orthogonal, that is, the documents don’t have
any similarity, and 1 when the two vectors are
parallel, that is, the documents are completely
identical.
9. Word Embedding
• Word embedding are high-dimensional vectors
that represent words. We can create them in an
unsupervised way from a collection of
documents, in general using neural networks, by
analyzing all the contexts in which the word
occurs.
• vectors that are similar (according to cosine
similarity) for words that appear in similar
contexts, and thus have a similar meaning.
• For example, since the words “teacher” and
“professor” can sometimes be
used interchangeably, their embedding will be
close together.
10. • For this reason, using word embedding can
enable us to handle synonyms or words with
similar meaning in the computation of
similarity, which we couldn’t do by using word
frequencies.
• However, word embedding are just vector
representations of words, and there are several
ways that we can use to integrate them into our
text similarity computation. In the next section,
we’ll see a basic example of how we can do this.
11.
12. TF-IDF
• TF-IDF (term frequency-inverse document frequency) is a
statistical measure that evaluates how relevant a word is to a
document in a collection of documents.
• This is done by multiplying two metrics: how many times a
word appears in a document, and the inverse document
frequency of the word across a set of documents.
13. TF (Term Frequency)
• The term frequency of a word in a document. A raw count of
instances a word appears in a document.
TF(t) = (Number of times term t appears in a
document) / (Total number of terms in the document).
14. IDF (Inverse Document
Frequency )
• The inverse document frequency of the word across a set of
documents. This means, how common or rare a word is in the
entire document set. The closer it is to 0, the more common a
word is. This metric can be calculated by taking the total
number of documents, dividing it by the number of documents
that contain a word, and calculating the logarithm.
•
IDF(t) = log_e(Total number of documents / Number of
documents with term t in it).
15. EXAMPLE
• Consider a document containing 100 words wherein the
word cat appears 3 times. The term frequency (i.e., tf)
for cat is then (3 / 100) = 0.03. Now, assume we have 10
million documents and the word cat appears in one thousand
of these. Then, the inverse document frequency (i.e., idf) is
calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf
weight is the product of these quantities: 0.03 * 4 = 0.12.
16. Cosine similarity
• A scenario that involves the requirement of identifying the
similarity between pairs of a document is a good use
case for the utilization of cosine similarity as a
quantification of the measurement of similarity between
two objects.
• Quantification of the similarity between two documents
can be obtained by converting the words into a
vectorized form of representation.
• The vector representations of the documents can then be
used within the cosine similarity formula to obtain a
quantification of similarity.
17. • The cosine similarity of 1 implies that the two documents
are exactly alike and a cosine similarity of 0 would point
to the conclusion that there are no similarities between
the two documents.
18. Example
• Here’s an example:
• Document 1: Deep Learning can be hard
• Document 2: Deep Learning can be simple
• Step 1: First we obtain vectorized representation of
the texts
19.
20. • Document 1: [1, 1, 1, 1, 1, 0] let’s refer to this as A
• Document 2: [1, 1, 1, 1, 0, 1] let’s refer to this as B
• Above we have two vectors (A and B) that are in a 6
dimension vector space
• Step 2: Find the cosine similarity
21. • cosine similarity (CS) = (A . B) / (||A|| ||B||)
• Calculate the dot product between A and B: 1.1 + 1.1 +
1.1 + 1.1 + 1.0 + 0.1 = 4
• Calculate the magnitude of the vector A: √1² + 1² + 1² +
1² + 1² + 0² = 2.2360679775
• Calculate the magnitude of the vector B: √1² + 1² + 1² +
1² + 0²+ 1² = 2.2360679775
• Calculate the cosine similarity: (4) /
(2.2360679775*2.2360679775) = 0.80 (80% similarity
between the sentences in both document)
22. Jaccard Similarity
• Jaccard Similarity is also known as the Jaccard
index and Intersection over Union. Jaccard
Similarity matric used to determine the similarity
between two text document means how the two text
documents close to each other in terms of their context
that is how many common words are exist over total
words.
23. • The mathematical representation of the Jaccard Similarity is:
• The Jaccard Similarity score is in a range of 0 to 1. If the two
documents are identical, Jaccard Similarity is 1. The Jaccard
similarity score is 0 if there are no common words between
two documents.
24. Example
doc_1 = "Data is the new oil of the digital economy“
doc_2 = "Data is a new oil"
words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy’}
words_doc2 = {'data', 'is', 'a', 'new', 'oil'}
25. • Now, we will calculate the intersection and union of these
two sets of words and measure the Jaccard
Similarity between doc_1 and doc_2.
•