SlideShare a Scribd company logo
1 of 44
Download to read offline
Indexing Vector Spaces 
Graphs Search Engines 
Piero Molino 
XY Lab
Relevance for 
New Publishing 
Search Engines can be considered publishers 
They select, filter, choose and sometimes create the 
content they give as result to queries 
From the user point of view, the search process is a 
blackbox, like magic 
Understanding it increases the awareness of the 
risks it has and could help bypassing or hacking it
Information Retrieval 
Let a collection of unstructured information (set of 
texts) 
Let an information need (query) 
Objective: Find items of the collection that answers 
the information need
Representation 
Simple/naive assumption 
A text can be represented by the words it contains 
Bag-of-words model
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:1 John:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1 attend:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1 attend:1 this:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1 attend:1 this:1 lab:1
Inverted index 
To access the documents, we build an index, just like 
humans do 
Inverted Index: word-to-document 
"Indice Analitico"
Example 
doc1:"I like football" 
doc2:"John likes football" 
doc3:"John likes basketball" 
I → doc1 
like → doc1 
football → doc1 doc2 
John → doc2 doc3 
likes → doc2 doc3 
basketball → doc3
Query 
Information need represented with the same bag-of-words 
model 
Who likes basketball? → Who:1 likes:1 basketball:1
Retrieval 
Who:1 likes:1 basketball:1 
I → doc1 
like → doc1 
football → doc1 doc2 
John → doc2 doc3 
likes → doc2 doc3 
basketball → doc3 
Result: [doc2, doc3]! 
(without any order... or no?)
Boolean Model 
likes → doc2 doc3 
basketball → doc3 
doc3 contains both "likes" and "basketball", 2/3 of the input 
query terms, doc2 contains only 1/3 
Boolean Model: represents only presence or absence of 
query words and ranks document according to how many 
query words they contain 
Result: [doc3, doc2]! 
(with this precise order)
Vector Space Model 
Represents documents and words as vectors or 
points in a (multidimensional) geometrical space 
We build a term x documents matrix 
doc1 doc2 doc3 
I 1 0 0 
like 1 0 0 
football 1 1 0 
John 0 1 1 
likes 0 1 1 
basketball 0 0 1
Graphical Representation 
likes 
1 
football 
1 
0 
doc1 
doc2 
doc3
Tf-idf 
The cells of the matrix contain the frequency of a 
word in a document (term frequency tf) 
This value can be counterbalanced by the number 
of documents that contain the word (inverse 
document frequency idf) 
The final weight is called tf-idf
Tf-idf Example 1 
"Romeo" is found 700 times in the document "Romeo 
and Juliet" and also in 7 other plays 
The tf-idf of "Romeo" in the document "Romeo and 
Juliet" is 7000x(1/7)=100
Tf-idf Example 2 
"and" is found 1000 times in the document "Romeo 
and Juliet" and also in 10000 other plays 
The tf-idf of "and" in the document "Romeo and Juliet" 
is 1000x(1/10000)=0.1
Similarity 
Queries are like an additional column in matrix 
We can compute a similarity between document and 
queries
Similarity 
0 
doc1 
doc2 
doc3 
query 
Euclidean Distance
Similarity 
0 
doc1 
doc2 
doc3 
query 
Vector length normalization 
Projection of one normalized 
vector over another = cos(θ) 
θ
Cosine Similarity 
The cosine of the angle between two vectors is a 
measure of how similar two vectors are 
As the vectors represents documents and queries, 
the cosine is a measure of similarity of how similar is 
a document with respect to the query
Ranking with 
Cosine Similarity 
Shows how query and documents are correlated 
1 = maximum correlation, 0 = no correlation, -1 = 
negative correlation 
The documents obtained from the inverted index are 
ranked according to their cosine similarity wrt. the 
query
Ranking on the web 
In the web documents (webpages) are connected to 
each other by hyperlinks 
Is there a way to exploit this topology? 
PageRank algorithm (and several others)
The web as a graph 
2 
6 
3 
7 
5 
4 
1 
Webpage 
Hyperlink
Matrix Representation 
1 2 3 4 5 6 7 
1 1/2 1/2 
2 1/2 1/2 
3 1 
4 1/3 1/3 1/3 
5 1/2 1/2 
6 1/2 1/2 
7 1
Simulating Random Walks 
e = random vector, can be interpreted as how important 
is a node in the network 
Es. e = 1:0.5, 2:0.5, 3:0.5, 4:0.5, ... 
A = the matrix 
e = e · A, repeatedly is like simulating walking randomly 
on the graph 
The process converges to a vector e where each value 
represents the importance in the network
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Importance - Authority 
Webpage 
Hyperlink 
2 
6 
3 
7 
5 
4 
1 
Importance (value in e)
Ranking with PageRank 
I → doc1 
like → doc1 
football → doc1 doc2 
John → doc2 doc3 
likes → doc2 doc3 
basketball → doc3 
e 
doc1 → 0.3 
doc2 → 0.1 
doc3 → 0.5 
The inverted index gives doc2 and doc3 
Using the importance in e for ranking 
Result: [doc3, doc2] (ordered list)
Real World Ranking 
Real word search engines exploit Vector-Space- 
Model-like approaches, PageRank-like approaches 
and several others 
They balance all the different factors observing what 
webpage you click on after issuing a query and using 
them as examples for a machine learning algorithm
Machine Learning 
A machine learning algorithm works like a baby 
You give him examples of what is a dog and what is 
a cat 
He learns to look at the pointy ears, the nose and 
other aspects of the animals 
When he sees a new animal he says "cat" or "dog" 
depending on his experience
Learning to Rank 
The different indicators of how good a webpage for 
a query is (cosine similarity, PageRank, etc.) are like 
the nose, the tail and the ears of the animal 
Instead of cat and dog, the algorithm classifies 
relevant or non-relevant! 
When you issue a query and click on a result it is 
marked as relevant, otherwise it is considered non-relevant
Issues 
Among the different indicators there are a lot of 
contextual ones 
Where you issue the query from, who you are, what 
you visited before, what result you clicked on 
before, ... 
This makes the result for each person, in each place, 
in different moments in time, different
The filter bubble 
A lot of diverse content is filtered from you 
The search engine shows you what it thinks you will 
be interested on, based on all this contextual factors 
Lack of transparency of the search process 
A way out of the bubble? Maybe favoring serendipity
Thank you 
for your attention

More Related Content

Similar to Indexing vector spaces graph search engines

Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Deepak K
 
The well tempered search application
The well tempered search applicationThe well tempered search application
The well tempered search applicationTed Sullivan
 
Researching on the Internet: Helpful hints/safety tips
Researching on the Internet: Helpful hints/safety tipsResearching on the Internet: Helpful hints/safety tips
Researching on the Internet: Helpful hints/safety tipsJennEpps
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spacesMounia Lalmas-Roelleke
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.docbutest
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.docbutest
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)Frank van Harmelen
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17Codemotion
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceJonathan Mugan
 
Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?jeansoulin
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIJonathan Mugan
 

Similar to Indexing vector spaces graph search engines (20)

Eacl 2006 Pedersen
Eacl 2006 PedersenEacl 2006 Pedersen
Eacl 2006 Pedersen
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
 
Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
 
Vivo Search
Vivo SearchVivo Search
Vivo Search
 
The well tempered search application
The well tempered search applicationThe well tempered search application
The well tempered search application
 
Researching on the Internet: Helpful hints/safety tips
Researching on the Internet: Helpful hints/safety tipsResearching on the Internet: Helpful hints/safety tips
Researching on the Internet: Helpful hints/safety tips
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.doc
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.doc
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
 
Week12
Week12Week12
Week12
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial Intelligence
 
Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AI
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Indexing vector spaces graph search engines

  • 1. Indexing Vector Spaces Graphs Search Engines Piero Molino XY Lab
  • 2. Relevance for New Publishing Search Engines can be considered publishers They select, filter, choose and sometimes create the content they give as result to queries From the user point of view, the search process is a blackbox, like magic Understanding it increases the awareness of the risks it has and could help bypassing or hacking it
  • 3. Information Retrieval Let a collection of unstructured information (set of texts) Let an information need (query) Objective: Find items of the collection that answers the information need
  • 4. Representation Simple/naive assumption A text can be represented by the words it contains Bag-of-words model
  • 5. Bag of Words "Me and John and Mary attend this lab" ! Me:1
  • 6. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:1
  • 7. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:1 John:1
  • 8. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1
  • 9. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1
  • 10. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1 attend:1
  • 11. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1 attend:1 this:1
  • 12. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1 attend:1 this:1 lab:1
  • 13. Inverted index To access the documents, we build an index, just like humans do Inverted Index: word-to-document "Indice Analitico"
  • 14. Example doc1:"I like football" doc2:"John likes football" doc3:"John likes basketball" I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 basketball → doc3
  • 15. Query Information need represented with the same bag-of-words model Who likes basketball? → Who:1 likes:1 basketball:1
  • 16. Retrieval Who:1 likes:1 basketball:1 I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 basketball → doc3 Result: [doc2, doc3]! (without any order... or no?)
  • 17. Boolean Model likes → doc2 doc3 basketball → doc3 doc3 contains both "likes" and "basketball", 2/3 of the input query terms, doc2 contains only 1/3 Boolean Model: represents only presence or absence of query words and ranks document according to how many query words they contain Result: [doc3, doc2]! (with this precise order)
  • 18. Vector Space Model Represents documents and words as vectors or points in a (multidimensional) geometrical space We build a term x documents matrix doc1 doc2 doc3 I 1 0 0 like 1 0 0 football 1 1 0 John 0 1 1 likes 0 1 1 basketball 0 0 1
  • 19. Graphical Representation likes 1 football 1 0 doc1 doc2 doc3
  • 20. Tf-idf The cells of the matrix contain the frequency of a word in a document (term frequency tf) This value can be counterbalanced by the number of documents that contain the word (inverse document frequency idf) The final weight is called tf-idf
  • 21. Tf-idf Example 1 "Romeo" is found 700 times in the document "Romeo and Juliet" and also in 7 other plays The tf-idf of "Romeo" in the document "Romeo and Juliet" is 7000x(1/7)=100
  • 22. Tf-idf Example 2 "and" is found 1000 times in the document "Romeo and Juliet" and also in 10000 other plays The tf-idf of "and" in the document "Romeo and Juliet" is 1000x(1/10000)=0.1
  • 23. Similarity Queries are like an additional column in matrix We can compute a similarity between document and queries
  • 24. Similarity 0 doc1 doc2 doc3 query Euclidean Distance
  • 25. Similarity 0 doc1 doc2 doc3 query Vector length normalization Projection of one normalized vector over another = cos(θ) θ
  • 26. Cosine Similarity The cosine of the angle between two vectors is a measure of how similar two vectors are As the vectors represents documents and queries, the cosine is a measure of similarity of how similar is a document with respect to the query
  • 27. Ranking with Cosine Similarity Shows how query and documents are correlated 1 = maximum correlation, 0 = no correlation, -1 = negative correlation The documents obtained from the inverted index are ranked according to their cosine similarity wrt. the query
  • 28. Ranking on the web In the web documents (webpages) are connected to each other by hyperlinks Is there a way to exploit this topology? PageRank algorithm (and several others)
  • 29. The web as a graph 2 6 3 7 5 4 1 Webpage Hyperlink
  • 30. Matrix Representation 1 2 3 4 5 6 7 1 1/2 1/2 2 1/2 1/2 3 1 4 1/3 1/3 1/3 5 1/2 1/2 6 1/2 1/2 7 1
  • 31. Simulating Random Walks e = random vector, can be interpreted as how important is a node in the network Es. e = 1:0.5, 2:0.5, 3:0.5, 4:0.5, ... A = the matrix e = e · A, repeatedly is like simulating walking randomly on the graph The process converges to a vector e where each value represents the importance in the network
  • 32. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 33. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 34. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 35. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 36. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 37. Importance - Authority Webpage Hyperlink 2 6 3 7 5 4 1 Importance (value in e)
  • 38. Ranking with PageRank I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 basketball → doc3 e doc1 → 0.3 doc2 → 0.1 doc3 → 0.5 The inverted index gives doc2 and doc3 Using the importance in e for ranking Result: [doc3, doc2] (ordered list)
  • 39. Real World Ranking Real word search engines exploit Vector-Space- Model-like approaches, PageRank-like approaches and several others They balance all the different factors observing what webpage you click on after issuing a query and using them as examples for a machine learning algorithm
  • 40. Machine Learning A machine learning algorithm works like a baby You give him examples of what is a dog and what is a cat He learns to look at the pointy ears, the nose and other aspects of the animals When he sees a new animal he says "cat" or "dog" depending on his experience
  • 41. Learning to Rank The different indicators of how good a webpage for a query is (cosine similarity, PageRank, etc.) are like the nose, the tail and the ears of the animal Instead of cat and dog, the algorithm classifies relevant or non-relevant! When you issue a query and click on a result it is marked as relevant, otherwise it is considered non-relevant
  • 42. Issues Among the different indicators there are a lot of contextual ones Where you issue the query from, who you are, what you visited before, what result you clicked on before, ... This makes the result for each person, in each place, in different moments in time, different
  • 43. The filter bubble A lot of diverse content is filtered from you The search engine shows you what it thinks you will be interested on, based on all this contextual factors Lack of transparency of the search process A way out of the bubble? Maybe favoring serendipity
  • 44. Thank you for your attention