Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Information retrieval: Creating a Search Engine

3,004 views

Published on

The tf-idf model for information retrieval along with basics of spelling correction.

Explanation for each slide is given.

Published in: Technology, Design

Information retrieval: Creating a Search Engine

  1. 1. Information Retrieval: Creating a Search Engine
  2. 2. Agenda• Introduction• Basic Text Processing• IR Basics• Term Frequency Weighting• Inverse Document Frequency Weighting• TF – IDF Score• Activity
  3. 3. Agenda• Introduction• Basic Text Processing• IR Basics• Term Frequency Weighting• Inverse Document Frequency Weighting• TF – IDF Score• Activity
  4. 4. IntroductionInformation Retrieval (IR) is finding material (usuallydocuments) of an unstructured nature (usually text)that satisfies an information need from within largecollections. - C Manning, P Raghavan, Hinrich
  5. 5. Agenda• Introduction• Basic Text Processing• IR Basics• Term Frequency Weighting• Inverse Document Frequency Weighting• TF – IDF Score• Activity
  6. 6. Basic Text ProcessingWord Tokenization:(Dividing a sentence into words.)• Hey, where have you been last week?• I flew from New York to Illinois and finally landed in Jammu & Kashmir.• Long Journey, huh?
  7. 7. Issues in Tokenization• New York  One Token or Two?• Jammu & Kashmir  One Token or Three?• Huh, Hmmm, Uh  ??• India’s Economy  India? Indias? India’s?• Won’t, isn’t  Will not? Is not? Isn’t?• Mother-in-law  Mother in Law?• Ph.D.  PhD? Ph D ? Ph.D.?
  8. 8. Language Issues in TokenizationGerman Noun Compounds are not segmented• Lebensversicherungsgesellschaftsangesteller• ‘life insurance company employee’ Example from Foundations of Natural Language Processing; C. Manning, Henrich S
  9. 9. Language Issues in Tokenization• There is no space between words in Chinese and Japanese.• In Japanese, multiple alphabets are intermingled.• Arabic (or Hebrew) is basically written right to left, but certain items like numbers are written left to right.
  10. 10. Regular Expressions• Regular Expressions are a way to represent patterns in text.
  11. 11. Regular Expressions• Regular Expressions are a way to represent patterns in text.
  12. 12. Agenda• Introduction• Basic Text Processing• IR Basics• Term Frequency Weighting• Inverse Document Frequency Weighting• TF – IDF Score• Activity
  13. 13. Basic Search Model User Needs Info QueryQuery Refining Results
  14. 14. IR System Evaluation• Precision: How many documents retrieved are relevant to user’s information need?• Recall: How many relevant documents in collection are retrieved?• F Score: Harmonic Mean of precision and recall
  15. 15. Ranked IR• How to rank the retrieved documents?• We assign a score to each document.• This score should measure how good is the “query – document” pair.
  16. 16. Agenda• Introduction• Basic Text Processing• IR Basics• Term Frequency Weighting• Inverse Document Frequency Weighting• TF – IDF Score• Activity
  17. 17. • We need a way to assign score to each “query – document” pair.• More frequent the query term in the document, the higher should be the score.• If a query term doesn’t occur in document: score should be 0.
  18. 18. Term Frequency Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 62 0 51 49 24 29 zaphod 214 0 88 405 2 0 ship 59 2 85 126 27 119 arthur 347 0 376 236 313 0fiordland 0 9 0 0 0 0santorini 0 0 3 0 0 0wordlings 0 0 0 0 1 0
  19. 19. Term Frequency• How to use tf for computing the query-document match score?• A document with 85 occurrences of a term is more relevant than a document with 1 occurrences.• But NOT 85 times more relevant!• Relevance don’t increase linearly with frequency!• Raw term frequency will not help!
  20. 20. Log-tf Weighting• Log term frequency weight of term t in document d is 1 + log10tf , if tf > 0wt,d = 0 , if tf <= 0
  21. 21. tf ScoreS = ∑t in q∩d 1 + log10tf q,d
  22. 22. Term Frequency Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 62 0 51 49 24 29 zaphod 214 0 88 405 2 0 ship 59 2 85 126 27 119 arthur 347 0 376 236 313 0fiordland 0 9 0 0 0 0santorini 0 0 3 0 0 0wordlings 0 0 0 0 1 0
  23. 23. Term Frequency Weight Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 2.79 0 2.71 2.69 2.38 2.46 zaphod 3.33 0 2.94 3.61 1.30 0 ship 2.77 1.30 1.93 3.10 2.43 3.08 arthur 3.54 0 3.58 3.37 3.50 0fiordland 0 1.95 0 0 0 0santorini 0 0 1.47 0 0 0wordlings 0 0 0 0 1.00 0
  24. 24. Term Frequency• Problem: all terms are considered equally important.• Certain terms are of no use when determining relevance.
  25. 25. Term Frequency• Rare terms are more informative than frequent terms in the document.  information retrieval• Frequent terms are less informative than rare terms (eg man, house, cat)• Document containing frequent term is likely to be relevant  But relevance is not guaranteed.• Frequent terms should get +ve weights, but lower than rare terms.
  26. 26. Agenda• Introduction• Basic Text Processing• IR Basics• Term Frequency Weighting• Inverse Document Frequency Weighting• TF – IDF Score• Activity
  27. 27. Document Frequency• We use document frequency (df)• dft is the number of documents in the collection that contains the term t.• Lower the dft, rarer is the term, and higher the importance.
  28. 28. Inverse Document Frequency• We take inverse document frequency (idf) of t idft = log10(N/dft)• N is the number of documents in the collection
  29. 29. idf example df idf 1 7 10 6 100 5 1000 4 10000 3 100000 2 1000000 1 10000000 0 idft = log10(N/dft), suppose N = 107There is only one value for each term in the collection.
  30. 30. idf weights Term df idf galaxy 5 0.079 zaphod 4 0.176 ship 6 0 arthur 4 0.176fiordland 1 0.778santorini 1 0.778wordlings 1 0.778 idft = log10(N/dft), here N = 6
  31. 31. Agenda• Introduction• Basic Text Processing• IR Basics• Term Frequency Weighting• Inverse Document Frequency Weighting• TF – IDF Score• Activity
  32. 32. tf-idf Weighting• tf-idf weight is simple the product of tf and idf weight. Wt,d = (1 + log10 tft,d) x log10(N/dft)• Increases with number of occurrences within document.• Increases with rarity of term in collection.
  33. 33. tf-idf Score• Final ranking of a document d for a query q depends on Score(q,d) = ∑ Wt,d
  34. 34. Term Frequency Weight Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 2.79 0 2.71 2.69 2.38 2.46 zaphod 3.33 0 2.94 3.61 1.30 0 ship 2.77 1.30 1.93 3.10 2.43 3.08 arthur 3.54 0 3.58 3.37 3.50 0fiordland 0 1.95 0 0 0 0santorini 0 0 1.47 0 0 0wordlings 0 0 0 0 1.00 0
  35. 35. idf weights Term df idf galaxy 5 0.079 zaphod 4 0.176 ship 6 0 arthur 4 0.176fiordland 1 0.778santorini 1 0.778wordlings 1 0.778 idft = log10(N/dft), here N = 6
  36. 36. tf-idf Weight Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 0.2204 0 0.2140 0.2125 0.1880 0.1943 zaphod 0.5861 0 0.5174 0.6354 0.2288 0 ship 0 0 0 0 0 0 arthur 0.6230 0 0.6301 0.5931 0.6160 0fiordland 0 1.5171 0 0 0 0santorini 0 0 1.1437 0 0 0wordlings 0 0 0 0 0.7780 0
  37. 37. Agenda• Introduction• Basic Text Processing• IR Basics• Term Frequency Weighting• Inverse Document Frequency Weighting• TF – IDF Score• Activity: Spelling Correction
  38. 38. Using a Dictionary• How to tell if a word’s spelling is correct or not?• Maybe, use a dictionary? Like Oxford’s Dictionary• Then what about terms like “Accenture” or “Brangelina” or “Paani da Rang”?• Any dictionary definitely doesn’t contain words like this.• Such terms will be flagged as spelling errors. But this should not be the case!
  39. 39. Using a Corpus• So, we use a collection of documents.• This collection is used as a basis for spell correction.• To correct the spelling  we find a word in the collection which is nearest to the wrongly spelled word.  Replace the wrong word with the new word we just found.
  40. 40. Minimum Edit Distance• Minimum number of edit operations  Insertion (add a letter)  Deletion (remove one letter)  Substitution (change one letter to another)  Transposition (swap adjacent letters) needed to transform one word into the other.
  41. 41. Minimum Edit Distance (Example)* B I O G R A P H YA U T O G R A P H *i s s d• Let cost of each operation be 1  Total edit distance between these words = 4
  42. 42. Spelling Correction• For a given word, find all words at edit distance 1 and 2.• Which of these words is the most likely spelling correction for the given word?• The one that occurs the maximum time in the corpus. That’s the answer!
  43. 43. Minimum Edit Distance• Finding all words at edit distance 1, will result in a huge collection.• For a word of length n,  Insertions: 26(n + 1)  Deletions: n  Substitution: 26n  Transposition: n – 1  TOTAL: 54n + 25• Few of these might be duplicates• Number of words at edit distance 2 will be obviously more than 54n + 25.
  44. 44. Complete Algorithm1. Calculate the frequency of each word in the corpus.2. Input a word to be corrected.3. If input word is present in corpus, return that word.4. Else, find all words at an edit distance 1 and 2.5. Among these words, return the word which occurs the maximum time in the corpus.6. If none of these words occur in corpus, return the original word.
  45. 45. Evaluation number of words successfully correctedAccuracy = number of input words
  46. 46. You’re Given• A collection of documents (corpus)  Public domain books from Project Gutenberg  List of most frequent words from Wikitionary  British National Corpusput together by Peter Norvig.• Starter code in Java, Python and C#  Contains code for reading corpus and calculating accuracy.• A test set from Roger Mitton’s Birckbeck Spelling Error Corpus (slightly modified)  To test your algorithm.
  47. 47. TODO• Implement the algorithm in Java, Python or C#  Successful implementation will result in accuracy of ~31.50%• Modify the given algorithm to increase the accuracy to 50%• Going beyond 50% for the given test set is a bit challenging task.
  48. 48. Further ReadingInformation Retrieval:1. http://nlp.stanford.edu/fsnlp/2. http://nlp.stanford.edu/IR-book/3. https://class.coursera.org/nlp/class/index4. http://en.wikipedia.org/wiki/Tf*idf5. http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt6. http://research.google.com/pubs/InformationRetrievalandtheWeb.html7. http://norvig.com/8. http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and- the-vector-space-model-1.html9. http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-864- advanced-natural-language-processing-fall-2005/index.htm10. http://www.slideshare.net/butest/search-engines-3859807
  49. 49. Further ReadingSpelling Correction:1. http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf2. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.13923. http://www.stanford.edu/class/cs276/handouts/spelling.pptx4. http://alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html5. http://portal.acm.org/citation.cfm?id=1463806. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.94007. http://static.googleusercontent.com/external_content/untrusted_dlcp/research. google.com/en/us/pubs/archive/36180.pdf8. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9D A285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf

×