SlideShare a Scribd company logo
1 of 26
Calculating similarity
between Document
and Query
Contents
Text Similarity
Document Vector
Word Embedding
TF/IDF
Cosine Similarity
How to Computethe Similarity BetweenTwo
Text Documents
• Computing the similarity between two text documents
is a common task in NLP/IR, with several practical
applications. It has commonly been used to, for example,
rank results in a search engine or recommend similar
content to readers.
Text Similarity
• Our first step is to define what we mean by similarity.
We’ll do this by starting with two examples.
Let’s consider the sentences:
• The teacher gave his speech to an empty room
• There was almost nobody when the professor was
talking.
• Although they convey a very similar meaning, they are
written in a completely different way. In fact, the two
sentences just have one word in common (“the”), and
not a really significant one at that.
Document Vectors
• The traditional approach to compute text similarity
between documents is to do so by transforming the
input documents into real-valued vectors. The goal is to
have a vector space where similar documents are “close”,
according to a chosen similarity measure.
• This approach takes the name of Vector Space Model,
and it’s very convenient because it allows us to use
simple linear algebra to compute similarities. We just
have to define two things:
• A way of transforming documents into vectors
• A similarity measure for vectors.
• So, let’s see the possible ways of transforming a text
document into a vector.
Document Vectors: an Example
Let’s consider three sentences:
• We went to the pizza place and you ate no pizza at all
• I ate pizza with you yesterday at home
• There’s no place like home
• To build our vectors, we’ll count the occurrences of each word
in a sentence:
• Once we have our vectors, we can use
the standard similarity measure for this
situation: cosine similarity. Cosine
similarity measures the angle between the two
vectors and returns a real value between -1 and
1.
• If the vectors only have positive values, like in
our case, the output will actually lie between 0
and 1. It will return 0 when the two vectors are
orthogonal, that is, the documents don’t have
any similarity, and 1 when the two vectors are
parallel, that is, the documents are completely
identical.
Word Embedding
• Word embedding are high-dimensional vectors
that represent words. We can create them in an
unsupervised way from a collection of
documents, in general using neural networks, by
analyzing all the contexts in which the word
occurs.
• vectors that are similar (according to cosine
similarity) for words that appear in similar
contexts, and thus have a similar meaning.
• For example, since the words “teacher” and
“professor” can sometimes be
used interchangeably, their embedding will be
close together.
• For this reason, using word embedding can
enable us to handle synonyms or words with
similar meaning in the computation of
similarity, which we couldn’t do by using word
frequencies.
• However, word embedding are just vector
representations of words, and there are several
ways that we can use to integrate them into our
text similarity computation. In the next section,
we’ll see a basic example of how we can do this.
TF-IDF
• TF-IDF (term frequency-inverse document frequency) is a
statistical measure that evaluates how relevant a word is to a
document in a collection of documents.
• This is done by multiplying two metrics: how many times a
word appears in a document, and the inverse document
frequency of the word across a set of documents.
TF (Term Frequency)
• The term frequency of a word in a document. A raw count of
instances a word appears in a document.
TF(t) = (Number of times term t appears in a
document) / (Total number of terms in the document).
IDF (Inverse Document
Frequency )
• The inverse document frequency of the word across a set of
documents. This means, how common or rare a word is in the
entire document set. The closer it is to 0, the more common a
word is. This metric can be calculated by taking the total
number of documents, dividing it by the number of documents
that contain a word, and calculating the logarithm.
•
IDF(t) = log_e(Total number of documents / Number of
documents with term t in it).
EXAMPLE
• Consider a document containing 100 words wherein the
word cat appears 3 times. The term frequency (i.e., tf)
for cat is then (3 / 100) = 0.03. Now, assume we have 10
million documents and the word cat appears in one thousand
of these. Then, the inverse document frequency (i.e., idf) is
calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf
weight is the product of these quantities: 0.03 * 4 = 0.12.
Cosine similarity
• A scenario that involves the requirement of identifying the
similarity between pairs of a document is a good use
case for the utilization of cosine similarity as a
quantification of the measurement of similarity between
two objects.
• Quantification of the similarity between two documents
can be obtained by converting the words into a
vectorized form of representation.
• The vector representations of the documents can then be
used within the cosine similarity formula to obtain a
quantification of similarity.
• The cosine similarity of 1 implies that the two documents
are exactly alike and a cosine similarity of 0 would point
to the conclusion that there are no similarities between
the two documents.
Example
• Here’s an example:
• Document 1: Deep Learning can be hard
• Document 2: Deep Learning can be simple
• Step 1: First we obtain vectorized representation of
the texts
• Document 1: [1, 1, 1, 1, 1, 0] let’s refer to this as A
• Document 2: [1, 1, 1, 1, 0, 1] let’s refer to this as B
• Above we have two vectors (A and B) that are in a 6
dimension vector space
• Step 2: Find the cosine similarity
• cosine similarity (CS) = (A . B) / (||A|| ||B||)
• Calculate the dot product between A and B: 1.1 + 1.1 +
1.1 + 1.1 + 1.0 + 0.1 = 4
• Calculate the magnitude of the vector A: √1² + 1² + 1² +
1² + 1² + 0² = 2.2360679775
• Calculate the magnitude of the vector B: √1² + 1² + 1² +
1² + 0²+ 1² = 2.2360679775
• Calculate the cosine similarity: (4) /
(2.2360679775*2.2360679775) = 0.80 (80% similarity
between the sentences in both document)
Jaccard Similarity
• Jaccard Similarity is also known as the Jaccard
index and Intersection over Union. Jaccard
Similarity matric used to determine the similarity
between two text document means how the two text
documents close to each other in terms of their context
that is how many common words are exist over total
words.
• The mathematical representation of the Jaccard Similarity is:
• The Jaccard Similarity score is in a range of 0 to 1. If the two
documents are identical, Jaccard Similarity is 1. The Jaccard
similarity score is 0 if there are no common words between
two documents.
Example
doc_1 = "Data is the new oil of the digital economy“
doc_2 = "Data is a new oil"
words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy’}
words_doc2 = {'data', 'is', 'a', 'new', 'oil'}
• Now, we will calculate the intersection and union of these
two sets of words and measure the Jaccard
Similarity between doc_1 and doc_2.
•
IR.pptx

More Related Content

Similar to IR.pptx

vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptpepe3059
 
Knowledge based System
Knowledge based SystemKnowledge based System
Knowledge based SystemTamanna36
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
NS-CUK Seminar: J.H.Lee, Review on "Abstract Meaning Representation for Semb...
NS-CUK Seminar: J.H.Lee,  Review on "Abstract Meaning Representation for Semb...NS-CUK Seminar: J.H.Lee,  Review on "Abstract Meaning Representation for Semb...
NS-CUK Seminar: J.H.Lee, Review on "Abstract Meaning Representation for Semb...ssuser4b1f48
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectorsOsebe Sammi
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weightingVaibhav Khanna
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdfHabtamu100
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 

Similar to IR.pptx (20)

vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Word embedding
Word embedding Word embedding
Word embedding
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Knowledge based System
Knowledge based SystemKnowledge based System
Knowledge based System
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
NS-CUK Seminar: J.H.Lee, Review on "Abstract Meaning Representation for Semb...
NS-CUK Seminar: J.H.Lee,  Review on "Abstract Meaning Representation for Semb...NS-CUK Seminar: J.H.Lee,  Review on "Abstract Meaning Representation for Semb...
NS-CUK Seminar: J.H.Lee, Review on "Abstract Meaning Representation for Semb...
 
text
texttext
text
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Lec1
Lec1Lec1
Lec1
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
ijcai11
ijcai11ijcai11
ijcai11
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 

IR.pptx

  • 1.
  • 3. Contents Text Similarity Document Vector Word Embedding TF/IDF Cosine Similarity
  • 4. How to Computethe Similarity BetweenTwo Text Documents • Computing the similarity between two text documents is a common task in NLP/IR, with several practical applications. It has commonly been used to, for example, rank results in a search engine or recommend similar content to readers.
  • 5. Text Similarity • Our first step is to define what we mean by similarity. We’ll do this by starting with two examples. Let’s consider the sentences: • The teacher gave his speech to an empty room • There was almost nobody when the professor was talking. • Although they convey a very similar meaning, they are written in a completely different way. In fact, the two sentences just have one word in common (“the”), and not a really significant one at that.
  • 6. Document Vectors • The traditional approach to compute text similarity between documents is to do so by transforming the input documents into real-valued vectors. The goal is to have a vector space where similar documents are “close”, according to a chosen similarity measure. • This approach takes the name of Vector Space Model, and it’s very convenient because it allows us to use simple linear algebra to compute similarities. We just have to define two things: • A way of transforming documents into vectors • A similarity measure for vectors. • So, let’s see the possible ways of transforming a text document into a vector.
  • 7. Document Vectors: an Example Let’s consider three sentences: • We went to the pizza place and you ate no pizza at all • I ate pizza with you yesterday at home • There’s no place like home • To build our vectors, we’ll count the occurrences of each word in a sentence:
  • 8. • Once we have our vectors, we can use the standard similarity measure for this situation: cosine similarity. Cosine similarity measures the angle between the two vectors and returns a real value between -1 and 1. • If the vectors only have positive values, like in our case, the output will actually lie between 0 and 1. It will return 0 when the two vectors are orthogonal, that is, the documents don’t have any similarity, and 1 when the two vectors are parallel, that is, the documents are completely identical.
  • 9. Word Embedding • Word embedding are high-dimensional vectors that represent words. We can create them in an unsupervised way from a collection of documents, in general using neural networks, by analyzing all the contexts in which the word occurs. • vectors that are similar (according to cosine similarity) for words that appear in similar contexts, and thus have a similar meaning. • For example, since the words “teacher” and “professor” can sometimes be used interchangeably, their embedding will be close together.
  • 10. • For this reason, using word embedding can enable us to handle synonyms or words with similar meaning in the computation of similarity, which we couldn’t do by using word frequencies. • However, word embedding are just vector representations of words, and there are several ways that we can use to integrate them into our text similarity computation. In the next section, we’ll see a basic example of how we can do this.
  • 11.
  • 12. TF-IDF • TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. • This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.
  • 13. TF (Term Frequency) • The term frequency of a word in a document. A raw count of instances a word appears in a document. TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
  • 14. IDF (Inverse Document Frequency ) • The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm. • IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
  • 15. EXAMPLE • Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
  • 16. Cosine similarity • A scenario that involves the requirement of identifying the similarity between pairs of a document is a good use case for the utilization of cosine similarity as a quantification of the measurement of similarity between two objects. • Quantification of the similarity between two documents can be obtained by converting the words into a vectorized form of representation. • The vector representations of the documents can then be used within the cosine similarity formula to obtain a quantification of similarity.
  • 17. • The cosine similarity of 1 implies that the two documents are exactly alike and a cosine similarity of 0 would point to the conclusion that there are no similarities between the two documents.
  • 18. Example • Here’s an example: • Document 1: Deep Learning can be hard • Document 2: Deep Learning can be simple • Step 1: First we obtain vectorized representation of the texts
  • 19.
  • 20. • Document 1: [1, 1, 1, 1, 1, 0] let’s refer to this as A • Document 2: [1, 1, 1, 1, 0, 1] let’s refer to this as B • Above we have two vectors (A and B) that are in a 6 dimension vector space • Step 2: Find the cosine similarity
  • 21. • cosine similarity (CS) = (A . B) / (||A|| ||B||) • Calculate the dot product between A and B: 1.1 + 1.1 + 1.1 + 1.1 + 1.0 + 0.1 = 4 • Calculate the magnitude of the vector A: √1² + 1² + 1² + 1² + 1² + 0² = 2.2360679775 • Calculate the magnitude of the vector B: √1² + 1² + 1² + 1² + 0²+ 1² = 2.2360679775 • Calculate the cosine similarity: (4) / (2.2360679775*2.2360679775) = 0.80 (80% similarity between the sentences in both document)
  • 22. Jaccard Similarity • Jaccard Similarity is also known as the Jaccard index and Intersection over Union. Jaccard Similarity matric used to determine the similarity between two text document means how the two text documents close to each other in terms of their context that is how many common words are exist over total words.
  • 23. • The mathematical representation of the Jaccard Similarity is: • The Jaccard Similarity score is in a range of 0 to 1. If the two documents are identical, Jaccard Similarity is 1. The Jaccard similarity score is 0 if there are no common words between two documents.
  • 24. Example doc_1 = "Data is the new oil of the digital economy“ doc_2 = "Data is a new oil" words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy’} words_doc2 = {'data', 'is', 'a', 'new', 'oil'}
  • 25. • Now, we will calculate the intersection and union of these two sets of words and measure the Jaccard Similarity between doc_1 and doc_2. •