0
Indexing Vector Spaces 
Graphs Search Engines 
Piero Molino 
XY Lab
Relevance for 
New Publishing 
Search Engines can be considered publishers 
They select, filter, choose and sometimes crea...
Information Retrieval 
Let a collection of unstructured information (set of 
texts) 
Let an information need (query) 
Obje...
Representation 
Simple/naive assumption 
A text can be represented by the words it contains 
Bag-of-words model
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:1 John:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1 attend:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1 attend:1 this:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1 attend:1 this:1 lab:1
Inverted index 
To access the documents, we build an index, just like 
humans do 
Inverted Index: word-to-document 
"Indic...
Example 
doc1:"I like football" 
doc2:"John likes football" 
doc3:"John likes basketball" 
I → doc1 
like → doc1 
football...
Query 
Information need represented with the same bag-of-words 
model 
Who likes basketball? → Who:1 likes:1 basketball:1
Retrieval 
Who:1 likes:1 basketball:1 
I → doc1 
like → doc1 
football → doc1 doc2 
John → doc2 doc3 
likes → doc2 doc3 
b...
Boolean Model 
likes → doc2 doc3 
basketball → doc3 
doc3 contains both "likes" and "basketball", 2/3 of the input 
query ...
Vector Space Model 
Represents documents and words as vectors or 
points in a (multidimensional) geometrical space 
We bui...
Graphical Representation 
likes 
1 
football 
1 
0 
doc1 
doc2 
doc3
Tf-idf 
The cells of the matrix contain the frequency of a 
word in a document (term frequency tf) 
This value can be coun...
Tf-idf Example 1 
"Romeo" is found 700 times in the document "Romeo 
and Juliet" and also in 7 other plays 
The tf-idf of ...
Tf-idf Example 2 
"and" is found 1000 times in the document "Romeo 
and Juliet" and also in 10000 other plays 
The tf-idf ...
Similarity 
Queries are like an additional column in matrix 
We can compute a similarity between document and 
queries
Similarity 
0 
doc1 
doc2 
doc3 
query 
Euclidean Distance
Similarity 
0 
doc1 
doc2 
doc3 
query 
Vector length normalization 
Projection of one normalized 
vector over another = c...
Cosine Similarity 
The cosine of the angle between two vectors is a 
measure of how similar two vectors are 
As the vector...
Ranking with 
Cosine Similarity 
Shows how query and documents are correlated 
1 = maximum correlation, 0 = no correlation...
Ranking on the web 
In the web documents (webpages) are connected to 
each other by hyperlinks 
Is there a way to exploit ...
The web as a graph 
2 
6 
3 
7 
5 
4 
1 
Webpage 
Hyperlink
Matrix Representation 
1 2 3 4 5 6 7 
1 1/2 1/2 
2 1/2 1/2 
3 1 
4 1/3 1/3 1/3 
5 1/2 1/2 
6 1/2 1/2 
7 1
Simulating Random Walks 
e = random vector, can be interpreted as how important 
is a node in the network 
Es. e = 1:0.5, ...
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Importance - Authority 
Webpage 
Hyperlink 
2 
6 
3 
7 
5 
4 
1 
Importance (value in e)
Ranking with PageRank 
I → doc1 
like → doc1 
football → doc1 doc2 
John → doc2 doc3 
likes → doc2 doc3 
basketball → doc3...
Real World Ranking 
Real word search engines exploit Vector-Space- 
Model-like approaches, PageRank-like approaches 
and s...
Machine Learning 
A machine learning algorithm works like a baby 
You give him examples of what is a dog and what is 
a ca...
Learning to Rank 
The different indicators of how good a webpage for 
a query is (cosine similarity, PageRank, etc.) are l...
Issues 
Among the different indicators there are a lot of 
contextual ones 
Where you issue the query from, who you are, w...
The filter bubble 
A lot of diverse content is filtered from you 
The search engine shows you what it thinks you will 
be ...
Thank you 
for your attention
Upcoming SlideShare
Loading in...5
×

Indexing vector spaces graph search engines

855

Published on

A brief introduction for non technical people about the internals of a search engine

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
855
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
12
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Indexing vector spaces graph search engines"

  1. 1. Indexing Vector Spaces Graphs Search Engines Piero Molino XY Lab
  2. 2. Relevance for New Publishing Search Engines can be considered publishers They select, filter, choose and sometimes create the content they give as result to queries From the user point of view, the search process is a blackbox, like magic Understanding it increases the awareness of the risks it has and could help bypassing or hacking it
  3. 3. Information Retrieval Let a collection of unstructured information (set of texts) Let an information need (query) Objective: Find items of the collection that answers the information need
  4. 4. Representation Simple/naive assumption A text can be represented by the words it contains Bag-of-words model
  5. 5. Bag of Words "Me and John and Mary attend this lab" ! Me:1
  6. 6. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:1
  7. 7. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:1 John:1
  8. 8. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1
  9. 9. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1
  10. 10. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1 attend:1
  11. 11. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1 attend:1 this:1
  12. 12. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1 attend:1 this:1 lab:1
  13. 13. Inverted index To access the documents, we build an index, just like humans do Inverted Index: word-to-document "Indice Analitico"
  14. 14. Example doc1:"I like football" doc2:"John likes football" doc3:"John likes basketball" I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 basketball → doc3
  15. 15. Query Information need represented with the same bag-of-words model Who likes basketball? → Who:1 likes:1 basketball:1
  16. 16. Retrieval Who:1 likes:1 basketball:1 I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 basketball → doc3 Result: [doc2, doc3]! (without any order... or no?)
  17. 17. Boolean Model likes → doc2 doc3 basketball → doc3 doc3 contains both "likes" and "basketball", 2/3 of the input query terms, doc2 contains only 1/3 Boolean Model: represents only presence or absence of query words and ranks document according to how many query words they contain Result: [doc3, doc2]! (with this precise order)
  18. 18. Vector Space Model Represents documents and words as vectors or points in a (multidimensional) geometrical space We build a term x documents matrix doc1 doc2 doc3 I 1 0 0 like 1 0 0 football 1 1 0 John 0 1 1 likes 0 1 1 basketball 0 0 1
  19. 19. Graphical Representation likes 1 football 1 0 doc1 doc2 doc3
  20. 20. Tf-idf The cells of the matrix contain the frequency of a word in a document (term frequency tf) This value can be counterbalanced by the number of documents that contain the word (inverse document frequency idf) The final weight is called tf-idf
  21. 21. Tf-idf Example 1 "Romeo" is found 700 times in the document "Romeo and Juliet" and also in 7 other plays The tf-idf of "Romeo" in the document "Romeo and Juliet" is 7000x(1/7)=100
  22. 22. Tf-idf Example 2 "and" is found 1000 times in the document "Romeo and Juliet" and also in 10000 other plays The tf-idf of "and" in the document "Romeo and Juliet" is 1000x(1/10000)=0.1
  23. 23. Similarity Queries are like an additional column in matrix We can compute a similarity between document and queries
  24. 24. Similarity 0 doc1 doc2 doc3 query Euclidean Distance
  25. 25. Similarity 0 doc1 doc2 doc3 query Vector length normalization Projection of one normalized vector over another = cos(θ) θ
  26. 26. Cosine Similarity The cosine of the angle between two vectors is a measure of how similar two vectors are As the vectors represents documents and queries, the cosine is a measure of similarity of how similar is a document with respect to the query
  27. 27. Ranking with Cosine Similarity Shows how query and documents are correlated 1 = maximum correlation, 0 = no correlation, -1 = negative correlation The documents obtained from the inverted index are ranked according to their cosine similarity wrt. the query
  28. 28. Ranking on the web In the web documents (webpages) are connected to each other by hyperlinks Is there a way to exploit this topology? PageRank algorithm (and several others)
  29. 29. The web as a graph 2 6 3 7 5 4 1 Webpage Hyperlink
  30. 30. Matrix Representation 1 2 3 4 5 6 7 1 1/2 1/2 2 1/2 1/2 3 1 4 1/3 1/3 1/3 5 1/2 1/2 6 1/2 1/2 7 1
  31. 31. Simulating Random Walks e = random vector, can be interpreted as how important is a node in the network Es. e = 1:0.5, 2:0.5, 3:0.5, 4:0.5, ... A = the matrix e = e · A, repeatedly is like simulating walking randomly on the graph The process converges to a vector e where each value represents the importance in the network
  32. 32. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  33. 33. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  34. 34. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  35. 35. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  36. 36. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  37. 37. Importance - Authority Webpage Hyperlink 2 6 3 7 5 4 1 Importance (value in e)
  38. 38. Ranking with PageRank I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 basketball → doc3 e doc1 → 0.3 doc2 → 0.1 doc3 → 0.5 The inverted index gives doc2 and doc3 Using the importance in e for ranking Result: [doc3, doc2] (ordered list)
  39. 39. Real World Ranking Real word search engines exploit Vector-Space- Model-like approaches, PageRank-like approaches and several others They balance all the different factors observing what webpage you click on after issuing a query and using them as examples for a machine learning algorithm
  40. 40. Machine Learning A machine learning algorithm works like a baby You give him examples of what is a dog and what is a cat He learns to look at the pointy ears, the nose and other aspects of the animals When he sees a new animal he says "cat" or "dog" depending on his experience
  41. 41. Learning to Rank The different indicators of how good a webpage for a query is (cosine similarity, PageRank, etc.) are like the nose, the tail and the ears of the animal Instead of cat and dog, the algorithm classifies relevant or non-relevant! When you issue a query and click on a result it is marked as relevant, otherwise it is considered non-relevant
  42. 42. Issues Among the different indicators there are a lot of contextual ones Where you issue the query from, who you are, what you visited before, what result you clicked on before, ... This makes the result for each person, in each place, in different moments in time, different
  43. 43. The filter bubble A lot of diverse content is filtered from you The search engine shows you what it thinks you will be interested on, based on all this contextual factors Lack of transparency of the search process A way out of the bubble? Maybe favoring serendipity
  44. 44. Thank you for your attention
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×