SlideShare a Scribd company logo
1
Text Similarity
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
● Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Introduction to Information Retrieval, Cambridge
University Press. 2008. Sections 6.2, 6.3.
2
Why are these similar?
Why are these different?
3
Why are these similar?
Why are these different?
4
Various levels of text similarity
● String distance (e.g. edit distance)
● Lexical overlap
● Vector space model
– A simple model of text similarity,
introduced by Salton et al. in 1975
● Usage of semantic resources
● Automatic reasoning/understanding
● AI-complete text similarity
5
Running example
● Q: "gold silver truck"
● D1: "Shipment of gold damaged in a fire"
● D2: "Delivery of silver arrived in a silver truck"
● D3: "Shipment of gold arrived in a truck"
Which document is more similar to Q?
6
Bag of Words Model: Binary vectors
● First, normalize (in this case, lowercase)
● Second, compute vocabulary and sort
– a arrived damaged delivery fire gold in of shipment
silver truck
a arrived damag
ed
delivery fire gold in of shipment silver truck
Shipment of
gold
damaged in
a fire
1 0 1 0 1 1 1 1 1 0 0
Shipment of
gold arrived
in a truck
1 1 0 0 0 1 1 1 1 0 1
7
Distance
● Similarity between D1 and D2
– <v1,v2> = 5
What are the shortcomings of this method?
a arrived damag
ed
delivery fire gold in of shipment silver truck
D1.Shipment
of gold
damaged in a
fire
1 0 1 0 1 1 1 1 1 0 0
D2. Shipment
of gold arrived
in a truck
1 1 0 0 0 1 1 1 1 0 1
8
How important is a term?
● Common terms such as “a”,”in”,”of” don't say
much
● Rare terms such as “gold”,”truck” say more
● Document Frequency of term t
– Number of documents containing a term
– DF(t)
● Inverse document frequency of term t
–
9
Example
● D1: "Shipment of gold damaged in a fire"
● D2: "Delivery of silver arrived in a silver truck"
● D3: "Shipment of gold arrived in a truck"
● |D| = 3
● IDF(“gold”) = ?
● IDF(“a”) = ?
● IDF(“silver”) = ?
10
Example
● D1: "Shipment of gold damaged in a fire"
● D2: "Delivery of silver arrived in a silver truck"
● D3: "Shipment of gold arrived in a truck"
● |D| = 3
● IDF(“gold”) = log( 3 / 2 ) = 0.176 (using log10)
● IDF(“a”) = log( 3 / 3 ) = 0.000
● IDF(“silver”) = log( 3 / 1 ) = 0.477
11
Term frequency
● Term frequency(doc,term) = TF(doc,term)
– Number of times the term appears in a document
● If a document contains many occurrences of a
word, that word is important for the document
TF("Delivery of silver arrived in a silver truck”, “silver”) = 2
TF(“Delivery of silver arrived in a silver truck”,“arrived”) = 1
12
Document vectors
● Di,j corresponds to document i, term j
● Di,j = TF(i,j) x IDF(j)
Exercise:
Write the document vectors for all 3 documents
● D1: "Shipment of gold damaged in a fire"
● D2: "Delivery of silver arrived in a silver truck"
● D3: "Shipment of gold arrived in a truck"
Verify: D1,3 = 0.477; D2,10=0.954
Answer:
http://chato.cl/2015/data_analysis/exercise-answers/textdistance_exercise_01_answer.pd
f
13
Computing similarity
Image: https://moz.com/blog/lda-and-googles-rankings-well-correlated
● Each document is a
vector in the positive
quadrant
● The cosine of the angle
between vectors is their
similarity
14
What is the best document?
Exercise:
● Write the TF·IDF vector for Q
● Compute D1·Q, D2·Q, D3·Q (do not normalize)
● Verify you got these numbers (in a different ordering): { 0.062,
0.031, 0.486 }
● What is the best document?
Answer:
http://chato.cl/2015/data_analysis/exercise-answers/textdistance_exercise_01_ans
wer.pdf
● Q = “gold silver truck”
15
Pros/Cons
● We are losing information
– Sentence structure, proximity of words, ordering of the words
● How could we keep this?
– Capitalization and everything we lost during normalization
● How could we keep this?
● But
– It's really fast
– It works in practice
– It can be extended, e.g. different weighting schemes
16
Weighting
schemes
Most documents have some
structure
This structure allows us to do
something better than TF
What would you do?

More Related Content

What's hot

Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
Tharuka Vishwajith Sarathchandra
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Regular expressions-Theory of computation
Regular expressions-Theory of computationRegular expressions-Theory of computation
Regular expressions-Theory of computation
Bipul Roy Bpl
 
Servlet life cycle
Servlet life cycleServlet life cycle
Servlet life cycle
Venkateswara Rao N
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processing
Creditas
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
Nancykumari47
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design
MAHASREEM
 
Keccak
KeccakKeccak
Keccak
Rajeev Verma
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
Gurram Poorna Prudhvi
 
Chapter 5 Syntax Directed Translation
Chapter 5   Syntax Directed TranslationChapter 5   Syntax Directed Translation
Chapter 5 Syntax Directed Translation
Radhakrishnan Chinnusamy
 
Compiler Design LR parsing SLR ,LALR CLR
Compiler Design LR parsing SLR ,LALR CLRCompiler Design LR parsing SLR ,LALR CLR
Compiler Design LR parsing SLR ,LALR CLR
Riazul Islam
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
Akshaya Arunan
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
Gangadhar S
 
Query optimization
Query optimizationQuery optimization
Query optimization
Zunera Bukhari
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translation
Akshaya Arunan
 
Information and network security 13 playfair cipher
Information and network security 13 playfair cipherInformation and network security 13 playfair cipher
Information and network security 13 playfair cipher
Vaibhav Khanna
 

What's hot (20)

Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Regular expressions-Theory of computation
Regular expressions-Theory of computationRegular expressions-Theory of computation
Regular expressions-Theory of computation
 
Servlet life cycle
Servlet life cycleServlet life cycle
Servlet life cycle
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processing
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 
Wordnet Introduction
Wordnet IntroductionWordnet Introduction
Wordnet Introduction
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design
 
Keccak
KeccakKeccak
Keccak
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
Chapter 5 Syntax Directed Translation
Chapter 5   Syntax Directed TranslationChapter 5   Syntax Directed Translation
Chapter 5 Syntax Directed Translation
 
Compiler Design LR parsing SLR ,LALR CLR
Compiler Design LR parsing SLR ,LALR CLRCompiler Design LR parsing SLR ,LALR CLR
Compiler Design LR parsing SLR ,LALR CLR
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Query optimization
Query optimizationQuery optimization
Query optimization
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translation
 
Information and network security 13 playfair cipher
Information and network security 13 playfair cipherInformation and network security 13 playfair cipher
Information and network security 13 playfair cipher
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
Carlos Castillo (ChaTo)
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
Carlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
Carlos Castillo (ChaTo)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
Carlos Castillo (ChaTo)
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
Carlos Castillo (ChaTo)
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
Carlos Castillo (ChaTo)
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
Carlos Castillo (ChaTo)
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
Carlos Castillo (ChaTo)
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
Carlos Castillo (ChaTo)
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
Carlos Castillo (ChaTo)
 
Link prediction
Link predictionLink prediction
Link prediction
Carlos Castillo (ChaTo)
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Carlos Castillo (ChaTo)
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
Carlos Castillo (ChaTo)
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
Carlos Castillo (ChaTo)
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
Carlos Castillo (ChaTo)
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
Carlos Castillo (ChaTo)
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
Indexing
IndexingIndexing
Text Summarization
Text SummarizationText Summarization
Text Summarization
Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Text similarity and the vector space model

  • 1. 1 Text Similarity Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University Sapienza University of Rome Semester Fall 2015 Lecturer Carlos Castillo http://chato.cl/ Sources: ● Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. Sections 6.2, 6.3.
  • 2. 2 Why are these similar? Why are these different?
  • 3. 3 Why are these similar? Why are these different?
  • 4. 4 Various levels of text similarity ● String distance (e.g. edit distance) ● Lexical overlap ● Vector space model – A simple model of text similarity, introduced by Salton et al. in 1975 ● Usage of semantic resources ● Automatic reasoning/understanding ● AI-complete text similarity
  • 5. 5 Running example ● Q: "gold silver truck" ● D1: "Shipment of gold damaged in a fire" ● D2: "Delivery of silver arrived in a silver truck" ● D3: "Shipment of gold arrived in a truck" Which document is more similar to Q?
  • 6. 6 Bag of Words Model: Binary vectors ● First, normalize (in this case, lowercase) ● Second, compute vocabulary and sort – a arrived damaged delivery fire gold in of shipment silver truck a arrived damag ed delivery fire gold in of shipment silver truck Shipment of gold damaged in a fire 1 0 1 0 1 1 1 1 1 0 0 Shipment of gold arrived in a truck 1 1 0 0 0 1 1 1 1 0 1
  • 7. 7 Distance ● Similarity between D1 and D2 – <v1,v2> = 5 What are the shortcomings of this method? a arrived damag ed delivery fire gold in of shipment silver truck D1.Shipment of gold damaged in a fire 1 0 1 0 1 1 1 1 1 0 0 D2. Shipment of gold arrived in a truck 1 1 0 0 0 1 1 1 1 0 1
  • 8. 8 How important is a term? ● Common terms such as “a”,”in”,”of” don't say much ● Rare terms such as “gold”,”truck” say more ● Document Frequency of term t – Number of documents containing a term – DF(t) ● Inverse document frequency of term t –
  • 9. 9 Example ● D1: "Shipment of gold damaged in a fire" ● D2: "Delivery of silver arrived in a silver truck" ● D3: "Shipment of gold arrived in a truck" ● |D| = 3 ● IDF(“gold”) = ? ● IDF(“a”) = ? ● IDF(“silver”) = ?
  • 10. 10 Example ● D1: "Shipment of gold damaged in a fire" ● D2: "Delivery of silver arrived in a silver truck" ● D3: "Shipment of gold arrived in a truck" ● |D| = 3 ● IDF(“gold”) = log( 3 / 2 ) = 0.176 (using log10) ● IDF(“a”) = log( 3 / 3 ) = 0.000 ● IDF(“silver”) = log( 3 / 1 ) = 0.477
  • 11. 11 Term frequency ● Term frequency(doc,term) = TF(doc,term) – Number of times the term appears in a document ● If a document contains many occurrences of a word, that word is important for the document TF("Delivery of silver arrived in a silver truck”, “silver”) = 2 TF(“Delivery of silver arrived in a silver truck”,“arrived”) = 1
  • 12. 12 Document vectors ● Di,j corresponds to document i, term j ● Di,j = TF(i,j) x IDF(j) Exercise: Write the document vectors for all 3 documents ● D1: "Shipment of gold damaged in a fire" ● D2: "Delivery of silver arrived in a silver truck" ● D3: "Shipment of gold arrived in a truck" Verify: D1,3 = 0.477; D2,10=0.954 Answer: http://chato.cl/2015/data_analysis/exercise-answers/textdistance_exercise_01_answer.pd f
  • 13. 13 Computing similarity Image: https://moz.com/blog/lda-and-googles-rankings-well-correlated ● Each document is a vector in the positive quadrant ● The cosine of the angle between vectors is their similarity
  • 14. 14 What is the best document? Exercise: ● Write the TF·IDF vector for Q ● Compute D1·Q, D2·Q, D3·Q (do not normalize) ● Verify you got these numbers (in a different ordering): { 0.062, 0.031, 0.486 } ● What is the best document? Answer: http://chato.cl/2015/data_analysis/exercise-answers/textdistance_exercise_01_ans wer.pdf ● Q = “gold silver truck”
  • 15. 15 Pros/Cons ● We are losing information – Sentence structure, proximity of words, ordering of the words ● How could we keep this? – Capitalization and everything we lost during normalization ● How could we keep this? ● But – It's really fast – It works in practice – It can be extended, e.g. different weighting schemes
  • 16. 16 Weighting schemes Most documents have some structure This structure allows us to do something better than TF What would you do?