SlideShare a Scribd company logo
1 of 16
Download to read offline
1
Text Similarity
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
● Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Introduction to Information Retrieval, Cambridge
University Press. 2008. Sections 6.2, 6.3.
2
Why are these similar?
Why are these different?
3
Why are these similar?
Why are these different?
4
Various levels of text similarity
● String distance (e.g. edit distance)
● Lexical overlap
● Vector space model
– A simple model of text similarity,
introduced by Salton et al. in 1975
● Usage of semantic resources
● Automatic reasoning/understanding
● AI-complete text similarity
5
Running example
● Q: "gold silver truck"
● D1: "Shipment of gold damaged in a fire"
● D2: "Delivery of silver arrived in a silver truck"
● D3: "Shipment of gold arrived in a truck"
Which document is more similar to Q?
6
Bag of Words Model: Binary vectors
● First, normalize (in this case, lowercase)
● Second, compute vocabulary and sort
– a arrived damaged delivery fire gold in of shipment
silver truck
a arrived damag
ed
delivery fire gold in of shipment silver truck
Shipment of
gold
damaged in
a fire
1 0 1 0 1 1 1 1 1 0 0
Shipment of
gold arrived
in a truck
1 1 0 0 0 1 1 1 1 0 1
7
Distance
● Similarity between D1 and D2
– <v1,v2> = 5
What are the shortcomings of this method?
a arrived damag
ed
delivery fire gold in of shipment silver truck
D1.Shipment
of gold
damaged in a
fire
1 0 1 0 1 1 1 1 1 0 0
D2. Shipment
of gold arrived
in a truck
1 1 0 0 0 1 1 1 1 0 1
8
How important is a term?
● Common terms such as “a”,”in”,”of” don't say
much
● Rare terms such as “gold”,”truck” say more
● Document Frequency of term t
– Number of documents containing a term
– DF(t)
● Inverse document frequency of term t
–
9
Example
● D1: "Shipment of gold damaged in a fire"
● D2: "Delivery of silver arrived in a silver truck"
● D3: "Shipment of gold arrived in a truck"
● |D| = 3
● IDF(“gold”) = ?
● IDF(“a”) = ?
● IDF(“silver”) = ?
10
Example
● D1: "Shipment of gold damaged in a fire"
● D2: "Delivery of silver arrived in a silver truck"
● D3: "Shipment of gold arrived in a truck"
● |D| = 3
● IDF(“gold”) = log( 3 / 2 ) = 0.176 (using log10)
● IDF(“a”) = log( 3 / 3 ) = 0.000
● IDF(“silver”) = log( 3 / 1 ) = 0.477
11
Term frequency
● Term frequency(doc,term) = TF(doc,term)
– Number of times the term appears in a document
● If a document contains many occurrences of a
word, that word is important for the document
TF("Delivery of silver arrived in a silver truck”, “silver”) = 2
TF(“Delivery of silver arrived in a silver truck”,“arrived”) = 1
12
Document vectors
● Di,j corresponds to document i, term j
● Di,j = TF(i,j) x IDF(j)
Exercise:
Write the document vectors for all 3 documents
● D1: "Shipment of gold damaged in a fire"
● D2: "Delivery of silver arrived in a silver truck"
● D3: "Shipment of gold arrived in a truck"
Verify: D1,3 = 0.477; D2,10=0.954
Answer:
http://chato.cl/2015/data_analysis/exercise-answers/textdistance_exercise_01_answer.pd
f
13
Computing similarity
Image: https://moz.com/blog/lda-and-googles-rankings-well-correlated
● Each document is a
vector in the positive
quadrant
● The cosine of the angle
between vectors is their
similarity
14
What is the best document?
Exercise:
● Write the TF·IDF vector for Q
● Compute D1·Q, D2·Q, D3·Q (do not normalize)
● Verify you got these numbers (in a different ordering): { 0.062,
0.031, 0.486 }
● What is the best document?
Answer:
http://chato.cl/2015/data_analysis/exercise-answers/textdistance_exercise_01_ans
wer.pdf
● Q = “gold silver truck”
15
Pros/Cons
● We are losing information
– Sentence structure, proximity of words, ordering of the words
● How could we keep this?
– Capitalization and everything we lost during normalization
● How could we keep this?
● But
– It's really fast
– It works in practice
– It can be extended, e.g. different weighting schemes
16
Weighting
schemes
Most documents have some
structure
This structure allows us to do
something better than TF
What would you do?

More Related Content

What's hot

Introduction to Computer theory Daniel Cohen Chapter 4 & 5 Solutions
Introduction to Computer theory Daniel Cohen Chapter 4 & 5 SolutionsIntroduction to Computer theory Daniel Cohen Chapter 4 & 5 Solutions
Introduction to Computer theory Daniel Cohen Chapter 4 & 5 SolutionsAshu
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
 
Introduction to Computer theory Daniel Cohen Chapter 2 Solutions
Introduction to Computer theory Daniel Cohen Chapter 2 SolutionsIntroduction to Computer theory Daniel Cohen Chapter 2 Solutions
Introduction to Computer theory Daniel Cohen Chapter 2 SolutionsAshu
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Data Mining Case Study
Data Mining Case StudyData Mining Case Study
Data Mining Case StudyXiaomeng Chai
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge GraphsJeff Z. Pan
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Machine Learning in 5 Minutes— Classification
Machine Learning in 5 Minutes— ClassificationMachine Learning in 5 Minutes— Classification
Machine Learning in 5 Minutes— ClassificationBrian Lange
 
Streaming Graph Processing and Analytics
Streaming Graph Processing and AnalyticsStreaming Graph Processing and Analytics
Streaming Graph Processing and AnalyticsM. Tamer Özsu
 
Introduction to Data Stream Processing
Introduction to Data Stream ProcessingIntroduction to Data Stream Processing
Introduction to Data Stream ProcessingSafe Software
 
Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualizationVini Vasundharan
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 

What's hot (20)

Introduction to Computer theory Daniel Cohen Chapter 4 & 5 Solutions
Introduction to Computer theory Daniel Cohen Chapter 4 & 5 SolutionsIntroduction to Computer theory Daniel Cohen Chapter 4 & 5 Solutions
Introduction to Computer theory Daniel Cohen Chapter 4 & 5 Solutions
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Introduction to Computer theory Daniel Cohen Chapter 2 Solutions
Introduction to Computer theory Daniel Cohen Chapter 2 SolutionsIntroduction to Computer theory Daniel Cohen Chapter 2 Solutions
Introduction to Computer theory Daniel Cohen Chapter 2 Solutions
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining Case Study
Data Mining Case StudyData Mining Case Study
Data Mining Case Study
 
Temporal logic-model-checking
Temporal logic-model-checkingTemporal logic-model-checking
Temporal logic-model-checking
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Machine Learning in 5 Minutes— Classification
Machine Learning in 5 Minutes— ClassificationMachine Learning in 5 Minutes— Classification
Machine Learning in 5 Minutes— Classification
 
Shortest path
Shortest pathShortest path
Shortest path
 
Streaming Graph Processing and Analytics
Streaming Graph Processing and AnalyticsStreaming Graph Processing and Analytics
Streaming Graph Processing and Analytics
 
Introduction to Data Stream Processing
Introduction to Data Stream ProcessingIntroduction to Data Stream Processing
Introduction to Data Stream Processing
 
Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualization
 
03 Machine Learning Linear Algebra
03 Machine Learning Linear Algebra03 Machine Learning Linear Algebra
03 Machine Learning Linear Algebra
 
The CAP Theorem
The CAP Theorem The CAP Theorem
The CAP Theorem
 
Data visualization
Data visualizationData visualization
Data visualization
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Text similarity and the vector space model

  • 1. 1 Text Similarity Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University Sapienza University of Rome Semester Fall 2015 Lecturer Carlos Castillo http://chato.cl/ Sources: ● Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. Sections 6.2, 6.3.
  • 2. 2 Why are these similar? Why are these different?
  • 3. 3 Why are these similar? Why are these different?
  • 4. 4 Various levels of text similarity ● String distance (e.g. edit distance) ● Lexical overlap ● Vector space model – A simple model of text similarity, introduced by Salton et al. in 1975 ● Usage of semantic resources ● Automatic reasoning/understanding ● AI-complete text similarity
  • 5. 5 Running example ● Q: "gold silver truck" ● D1: "Shipment of gold damaged in a fire" ● D2: "Delivery of silver arrived in a silver truck" ● D3: "Shipment of gold arrived in a truck" Which document is more similar to Q?
  • 6. 6 Bag of Words Model: Binary vectors ● First, normalize (in this case, lowercase) ● Second, compute vocabulary and sort – a arrived damaged delivery fire gold in of shipment silver truck a arrived damag ed delivery fire gold in of shipment silver truck Shipment of gold damaged in a fire 1 0 1 0 1 1 1 1 1 0 0 Shipment of gold arrived in a truck 1 1 0 0 0 1 1 1 1 0 1
  • 7. 7 Distance ● Similarity between D1 and D2 – <v1,v2> = 5 What are the shortcomings of this method? a arrived damag ed delivery fire gold in of shipment silver truck D1.Shipment of gold damaged in a fire 1 0 1 0 1 1 1 1 1 0 0 D2. Shipment of gold arrived in a truck 1 1 0 0 0 1 1 1 1 0 1
  • 8. 8 How important is a term? ● Common terms such as “a”,”in”,”of” don't say much ● Rare terms such as “gold”,”truck” say more ● Document Frequency of term t – Number of documents containing a term – DF(t) ● Inverse document frequency of term t –
  • 9. 9 Example ● D1: "Shipment of gold damaged in a fire" ● D2: "Delivery of silver arrived in a silver truck" ● D3: "Shipment of gold arrived in a truck" ● |D| = 3 ● IDF(“gold”) = ? ● IDF(“a”) = ? ● IDF(“silver”) = ?
  • 10. 10 Example ● D1: "Shipment of gold damaged in a fire" ● D2: "Delivery of silver arrived in a silver truck" ● D3: "Shipment of gold arrived in a truck" ● |D| = 3 ● IDF(“gold”) = log( 3 / 2 ) = 0.176 (using log10) ● IDF(“a”) = log( 3 / 3 ) = 0.000 ● IDF(“silver”) = log( 3 / 1 ) = 0.477
  • 11. 11 Term frequency ● Term frequency(doc,term) = TF(doc,term) – Number of times the term appears in a document ● If a document contains many occurrences of a word, that word is important for the document TF("Delivery of silver arrived in a silver truck”, “silver”) = 2 TF(“Delivery of silver arrived in a silver truck”,“arrived”) = 1
  • 12. 12 Document vectors ● Di,j corresponds to document i, term j ● Di,j = TF(i,j) x IDF(j) Exercise: Write the document vectors for all 3 documents ● D1: "Shipment of gold damaged in a fire" ● D2: "Delivery of silver arrived in a silver truck" ● D3: "Shipment of gold arrived in a truck" Verify: D1,3 = 0.477; D2,10=0.954 Answer: http://chato.cl/2015/data_analysis/exercise-answers/textdistance_exercise_01_answer.pd f
  • 13. 13 Computing similarity Image: https://moz.com/blog/lda-and-googles-rankings-well-correlated ● Each document is a vector in the positive quadrant ● The cosine of the angle between vectors is their similarity
  • 14. 14 What is the best document? Exercise: ● Write the TF·IDF vector for Q ● Compute D1·Q, D2·Q, D3·Q (do not normalize) ● Verify you got these numbers (in a different ordering): { 0.062, 0.031, 0.486 } ● What is the best document? Answer: http://chato.cl/2015/data_analysis/exercise-answers/textdistance_exercise_01_ans wer.pdf ● Q = “gold silver truck”
  • 15. 15 Pros/Cons ● We are losing information – Sentence structure, proximity of words, ordering of the words ● How could we keep this? – Capitalization and everything we lost during normalization ● How could we keep this? ● But – It's really fast – It works in practice – It can be extended, e.g. different weighting schemes
  • 16. 16 Weighting schemes Most documents have some structure This structure allows us to do something better than TF What would you do?