SlideShare a Scribd company logo
1 of 49
Information Retrieval:
   Creating a Search
        Engine
Agenda
•   Introduction
•   Basic Text Processing
•   IR Basics
•   Term Frequency Weighting
•   Inverse Document Frequency Weighting
•   TF – IDF Score
•   Activity
Agenda
•   Introduction
•   Basic Text Processing
•   IR Basics
•   Term Frequency Weighting
•   Inverse Document Frequency Weighting
•   TF – IDF Score
•   Activity
Introduction
Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections.

                     - C Manning, P Raghavan, Hinrich
Agenda
•   Introduction
•   Basic Text Processing
•   IR Basics
•   Term Frequency Weighting
•   Inverse Document Frequency Weighting
•   TF – IDF Score
•   Activity
Basic Text Processing
Word Tokenization:
(Dividing a sentence into words.)


• Hey, where have you been last week?
• I flew from New York to Illinois and finally
  landed in Jammu & Kashmir.
• Long Journey, huh?
Issues in Tokenization
• New York  One Token or Two?
• Jammu & Kashmir  One Token or Three?
• Huh, Hmmm, Uh  ??
• India’s Economy  India? Indias? India’s?
• Won’t, isn’t  Will not? Is not? Isn’t?
• Mother-in-law  Mother in Law?
• Ph.D.  PhD? Ph D ? Ph.D.?
Language Issues in Tokenization


German Noun Compounds are not segmented

• Lebensversicherungsgesellschaftsangesteller
• ‘life insurance company employee’


       Example from Foundations of Natural Language Processing; C. Manning, Henrich S
Language Issues in Tokenization

• There is no space between words in Chinese
  and Japanese.
• In Japanese, multiple alphabets are
  intermingled.
• Arabic (or Hebrew) is basically written right to
  left, but certain items like numbers are written
  left to right.
Regular Expressions
• Regular Expressions are a way to represent
  patterns in text.
Regular Expressions
• Regular Expressions are a way to represent
  patterns in text.
Agenda
•   Introduction
•   Basic Text Processing
•   IR Basics
•   Term Frequency Weighting
•   Inverse Document Frequency Weighting
•   TF – IDF Score
•   Activity
Basic Search Model

                 User Needs Info




                     Query




Query Refining       Results
IR System Evaluation
• Precision: How many documents retrieved are
  relevant to user’s information need?
• Recall: How many relevant documents in
  collection are retrieved?
• F Score: Harmonic Mean of precision and recall
Ranked IR
• How to rank the retrieved documents?

• We assign a score to each document.
• This score should measure how good is the
  “query – document” pair.
Agenda
•   Introduction
•   Basic Text Processing
•   IR Basics
•   Term Frequency Weighting
•   Inverse Document Frequency Weighting
•   TF – IDF Score
•   Activity
• We need a way to assign score to each “query
  – document” pair.
• More frequent the query term in the
  document, the higher should be the score.
• If a query term doesn’t occur in document:
  score should be 0.
Term Frequency
            Hitchhiker’s     Last         Life,   Restaurant So Long &    Starship
             Guide to      Chance to   Universe & at End of Thanks for     Titanic
               Galaxy         See      Everything Universe all the Fish

 galaxy         62            0           51         49          24         29

 zaphod         214           0           88         405         2           0

  ship          59            2           85         126         27         119

 arthur         347           0           376        236        313          0

fiordland        0            9            0          0          0           0

santorini        0            0            3          0          0           0

wordlings        0            0            0          0          1           0
Term Frequency
• How to use tf for computing the query-document
  match score?

• A document with 85 occurrences of a term is
  more relevant than a document with 1
  occurrences.
• But NOT 85 times more relevant!
• Relevance don’t increase linearly with frequency!
• Raw term frequency will not help!
Log-tf Weighting
• Log term frequency weight of term t in
  document d is


                   1 + log10tf     , if tf > 0
wt,d =             0               , if tf <= 0
tf Score



S = ∑t in q∩d 1 + log10tf
 q,d
Term Frequency
            Hitchhiker’s     Last         Life,   Restaurant So Long &    Starship
             Guide to      Chance to   Universe & at End of Thanks for     Titanic
               Galaxy         See      Everything Universe all the Fish

 galaxy         62            0           51         49          24         29

 zaphod         214           0           88         405         2           0

  ship          59            2           85         126         27         119

 arthur         347           0           376        236        313          0

fiordland        0            9            0          0          0           0

santorini        0            0            3          0          0           0

wordlings        0            0            0          0          1           0
Term Frequency Weight
            Hitchhiker’s     Last         Life,   Restaurant So Long &    Starship
             Guide to      Chance to   Universe & at End of Thanks for     Titanic
               Galaxy         See      Everything Universe all the Fish

 galaxy        2.79           0          2.71        2.69       2.38       2.46

 zaphod        3.33           0          2.94        3.61       1.30         0

  ship         2.77          1.30        1.93        3.10       2.43       3.08

 arthur        3.54           0          3.58        3.37       3.50         0

fiordland        0           1.95          0          0          0           0

santorini        0            0          1.47         0          0           0

wordlings        0            0            0          0         1.00         0
Term Frequency

• Problem: all terms are considered equally
  important.
• Certain terms are of no use when determining
  relevance.
Term Frequency
• Rare terms are more informative than frequent
  terms in the document.
   information retrieval
• Frequent terms are less informative than rare
  terms (eg man, house, cat)
• Document containing frequent term is likely to
  be relevant
   But relevance is not guaranteed.
• Frequent terms should get +ve weights, but
  lower than rare terms.
Agenda
•   Introduction
•   Basic Text Processing
•   IR Basics
•   Term Frequency Weighting
•   Inverse Document Frequency Weighting
•   TF – IDF Score
•   Activity
Document Frequency
• We use document frequency (df)
• dft is the number of documents in the collection
  that contains the term t.

• Lower the dft, rarer is the term, and higher the
  importance.
Inverse Document Frequency
• We take inverse document frequency (idf) of t


       idft = log10(N/dft)
• N is the number of documents in the collection
idf example
       df                      idf
       1                       7
       10                      6
       100                     5
       1000                    4
       10000                   3
       100000                  2
       1000000                 1
       10000000                0



              idft = log10(N/dft), suppose N = 107

There is only one value for each term in the collection.
idf weights
  Term            df            idf

 galaxy       5         0.079

 zaphod       4         0.176

  ship        6         0

 arthur       4         0.176

fiordland     1         0.778

santorini     1         0.778

wordlings     1         0.778




    idft = log10(N/dft), here N = 6
Agenda
•   Introduction
•   Basic Text Processing
•   IR Basics
•   Term Frequency Weighting
•   Inverse Document Frequency Weighting
•   TF – IDF Score
•   Activity
tf-idf Weighting
• tf-idf weight is simple the product of tf and idf
  weight.


  Wt,d = (1 + log10 tft,d) x log10(N/dft)

• Increases with number of occurrences within
  document.
• Increases with rarity of term in collection.
tf-idf Score
• Final ranking of a document d for a query q
  depends on




    Score(q,d) = ∑ Wt,d
Term Frequency Weight
            Hitchhiker’s     Last         Life,   Restaurant So Long &    Starship
             Guide to      Chance to   Universe & at End of Thanks for     Titanic
               Galaxy         See      Everything Universe all the Fish

 galaxy        2.79           0          2.71        2.69       2.38       2.46

 zaphod        3.33           0          2.94        3.61       1.30         0

  ship         2.77          1.30        1.93        3.10       2.43       3.08

 arthur        3.54           0          3.58        3.37       3.50         0

fiordland        0           1.95          0          0          0           0

santorini        0            0          1.47         0          0           0

wordlings        0            0            0          0         1.00         0
idf weights
  Term            df            idf

 galaxy       5         0.079

 zaphod       4         0.176

  ship        6         0

 arthur       4         0.176

fiordland     1         0.778

santorini     1         0.778

wordlings     1         0.778




    idft = log10(N/dft), here N = 6
tf-idf Weight
            Hitchhiker’s     Last         Life,   Restaurant So Long &    Starship
             Guide to      Chance to   Universe & at End of Thanks for     Titanic
               Galaxy         See      Everything Universe all the Fish

 galaxy       0.2204          0         0.2140      0.2125     0.1880     0.1943

 zaphod       0.5861          0         0.5174      0.6354     0.2288        0

  ship           0            0            0          0          0           0

 arthur       0.6230          0         0.6301      0.5931     0.6160        0

fiordland        0          1.5171         0          0          0           0

santorini        0            0         1.1437        0          0           0

wordlings        0            0            0          0        0.7780        0
Agenda
•   Introduction
•   Basic Text Processing
•   IR Basics
•   Term Frequency Weighting
•   Inverse Document Frequency Weighting
•   TF – IDF Score
•   Activity: Spelling Correction
Using a Dictionary
• How to tell if a word’s spelling is correct or not?

• Maybe, use a dictionary? Like Oxford’s Dictionary
• Then what about terms like “Accenture” or
  “Brangelina” or “Paani da Rang”?
• Any dictionary definitely doesn’t contain words
  like this.
• Such terms will be flagged as spelling errors. But
  this should not be the case!
Using a Corpus
• So, we use a collection of documents.
• This collection is used as a basis for spell
  correction.

• To correct the spelling
    we find a word in the collection which is nearest to
     the wrongly spelled word.
    Replace the wrong word with the new word we just
     found.
Minimum Edit Distance
• Minimum number of edit operations
     Insertion (add a letter)
     Deletion (remove one letter)
     Substitution (change one letter to another)
     Transposition (swap adjacent letters)

 needed to transform one word into the other.
Minimum Edit Distance (Example)

*    B      I      O      G      R A          P     H   Y


A    U      T      O      G      R A          P     H   *
i    s      s                                           d
• Let cost of each operation be 1
     Total edit distance between these words = 4
Spelling Correction
• For a given word, find all words at edit distance 1
  and 2.
• Which of these words is the most likely spelling
  correction for the given word?

• The one that occurs the maximum time in the
  corpus. That’s the answer!
Minimum Edit Distance
• Finding all words at edit distance 1, will result in a
  huge collection.
• For a word of length n,
      Insertions: 26(n + 1)
      Deletions: n
      Substitution: 26n
      Transposition: n – 1
    TOTAL: 54n + 25
• Few of these might be duplicates
• Number of words at edit distance 2 will be obviously
  more than 54n + 25.
Complete Algorithm

1. Calculate the frequency of each word in the corpus.
2. Input a word to be corrected.
3. If input word is present in corpus, return that word.
4. Else, find all words at an edit distance 1 and 2.
5. Among these words, return the word which occurs the
   maximum time in the corpus.
6. If none of these words occur in corpus, return the
   original word.
Evaluation



             number of words successfully corrected
Accuracy =          number of input words
You’re Given
• A collection of documents (corpus)
    Public domain books from Project Gutenberg
    List of most frequent words from Wikitionary
    British National Corpus

put together by Peter Norvig.
• Starter code in Java, Python and C#
    Contains code for reading corpus and calculating accuracy.
• A test set from Roger Mitton’s Birckbeck Spelling Error
  Corpus (slightly modified)
    To test your algorithm.
TODO

• Implement the algorithm in Java, Python or C#
   Successful implementation will result in accuracy of ~31.50%
• Modify the given algorithm to increase the accuracy to
  50%
• Going beyond 50% for the given test set is a bit
  challenging task.
Further Reading
Information Retrieval:
1.  http://nlp.stanford.edu/fsnlp/
2.  http://nlp.stanford.edu/IR-book/
3.  https://class.coursera.org/nlp/class/index
4.  http://en.wikipedia.org/wiki/Tf*idf
5.  http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt
6.  http://research.google.com/pubs/InformationRetrievalandtheWeb.html
7.  http://norvig.com/
8.  http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-
    the-vector-space-model-1.html
9. http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-864-
    advanced-natural-language-processing-fall-2005/index.htm
10. http://www.slideshare.net/butest/search-engines-3859807
Further Reading
Spelling Correction:
1.   http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf
2.   http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.1392
3.   http://www.stanford.edu/class/cs276/handouts/spelling.pptx
4.   http://alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html
5.   http://portal.acm.org/citation.cfm?id=146380
6.   http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.9400
7.   http://static.googleusercontent.com/external_content/untrusted_dlcp/research.
     google.com/en/us/pubs/archive/36180.pdf
8.   http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9D
     A285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf

More Related Content

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Information retrieval: Creating a Search Engine

  • 1. Information Retrieval: Creating a Search Engine
  • 2. Agenda • Introduction • Basic Text Processing • IR Basics • Term Frequency Weighting • Inverse Document Frequency Weighting • TF – IDF Score • Activity
  • 3. Agenda • Introduction • Basic Text Processing • IR Basics • Term Frequency Weighting • Inverse Document Frequency Weighting • TF – IDF Score • Activity
  • 4. Introduction Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections. - C Manning, P Raghavan, Hinrich
  • 5. Agenda • Introduction • Basic Text Processing • IR Basics • Term Frequency Weighting • Inverse Document Frequency Weighting • TF – IDF Score • Activity
  • 6. Basic Text Processing Word Tokenization: (Dividing a sentence into words.) • Hey, where have you been last week? • I flew from New York to Illinois and finally landed in Jammu & Kashmir. • Long Journey, huh?
  • 7. Issues in Tokenization • New York  One Token or Two? • Jammu & Kashmir  One Token or Three? • Huh, Hmmm, Uh  ?? • India’s Economy  India? Indias? India’s? • Won’t, isn’t  Will not? Is not? Isn’t? • Mother-in-law  Mother in Law? • Ph.D.  PhD? Ph D ? Ph.D.?
  • 8. Language Issues in Tokenization German Noun Compounds are not segmented • Lebensversicherungsgesellschaftsangesteller • ‘life insurance company employee’ Example from Foundations of Natural Language Processing; C. Manning, Henrich S
  • 9. Language Issues in Tokenization • There is no space between words in Chinese and Japanese. • In Japanese, multiple alphabets are intermingled. • Arabic (or Hebrew) is basically written right to left, but certain items like numbers are written left to right.
  • 10. Regular Expressions • Regular Expressions are a way to represent patterns in text.
  • 11. Regular Expressions • Regular Expressions are a way to represent patterns in text.
  • 12. Agenda • Introduction • Basic Text Processing • IR Basics • Term Frequency Weighting • Inverse Document Frequency Weighting • TF – IDF Score • Activity
  • 13. Basic Search Model User Needs Info Query Query Refining Results
  • 14. IR System Evaluation • Precision: How many documents retrieved are relevant to user’s information need? • Recall: How many relevant documents in collection are retrieved? • F Score: Harmonic Mean of precision and recall
  • 15. Ranked IR • How to rank the retrieved documents? • We assign a score to each document. • This score should measure how good is the “query – document” pair.
  • 16. Agenda • Introduction • Basic Text Processing • IR Basics • Term Frequency Weighting • Inverse Document Frequency Weighting • TF – IDF Score • Activity
  • 17. • We need a way to assign score to each “query – document” pair. • More frequent the query term in the document, the higher should be the score. • If a query term doesn’t occur in document: score should be 0.
  • 18. Term Frequency Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 62 0 51 49 24 29 zaphod 214 0 88 405 2 0 ship 59 2 85 126 27 119 arthur 347 0 376 236 313 0 fiordland 0 9 0 0 0 0 santorini 0 0 3 0 0 0 wordlings 0 0 0 0 1 0
  • 19. Term Frequency • How to use tf for computing the query-document match score? • A document with 85 occurrences of a term is more relevant than a document with 1 occurrences. • But NOT 85 times more relevant! • Relevance don’t increase linearly with frequency! • Raw term frequency will not help!
  • 20. Log-tf Weighting • Log term frequency weight of term t in document d is 1 + log10tf , if tf > 0 wt,d = 0 , if tf <= 0
  • 21. tf Score S = ∑t in q∩d 1 + log10tf q,d
  • 22. Term Frequency Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 62 0 51 49 24 29 zaphod 214 0 88 405 2 0 ship 59 2 85 126 27 119 arthur 347 0 376 236 313 0 fiordland 0 9 0 0 0 0 santorini 0 0 3 0 0 0 wordlings 0 0 0 0 1 0
  • 23. Term Frequency Weight Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 2.79 0 2.71 2.69 2.38 2.46 zaphod 3.33 0 2.94 3.61 1.30 0 ship 2.77 1.30 1.93 3.10 2.43 3.08 arthur 3.54 0 3.58 3.37 3.50 0 fiordland 0 1.95 0 0 0 0 santorini 0 0 1.47 0 0 0 wordlings 0 0 0 0 1.00 0
  • 24. Term Frequency • Problem: all terms are considered equally important. • Certain terms are of no use when determining relevance.
  • 25. Term Frequency • Rare terms are more informative than frequent terms in the document.  information retrieval • Frequent terms are less informative than rare terms (eg man, house, cat) • Document containing frequent term is likely to be relevant  But relevance is not guaranteed. • Frequent terms should get +ve weights, but lower than rare terms.
  • 26. Agenda • Introduction • Basic Text Processing • IR Basics • Term Frequency Weighting • Inverse Document Frequency Weighting • TF – IDF Score • Activity
  • 27. Document Frequency • We use document frequency (df) • dft is the number of documents in the collection that contains the term t. • Lower the dft, rarer is the term, and higher the importance.
  • 28. Inverse Document Frequency • We take inverse document frequency (idf) of t idft = log10(N/dft) • N is the number of documents in the collection
  • 29. idf example df idf 1 7 10 6 100 5 1000 4 10000 3 100000 2 1000000 1 10000000 0 idft = log10(N/dft), suppose N = 107 There is only one value for each term in the collection.
  • 30. idf weights Term df idf galaxy 5 0.079 zaphod 4 0.176 ship 6 0 arthur 4 0.176 fiordland 1 0.778 santorini 1 0.778 wordlings 1 0.778 idft = log10(N/dft), here N = 6
  • 31. Agenda • Introduction • Basic Text Processing • IR Basics • Term Frequency Weighting • Inverse Document Frequency Weighting • TF – IDF Score • Activity
  • 32. tf-idf Weighting • tf-idf weight is simple the product of tf and idf weight. Wt,d = (1 + log10 tft,d) x log10(N/dft) • Increases with number of occurrences within document. • Increases with rarity of term in collection.
  • 33. tf-idf Score • Final ranking of a document d for a query q depends on Score(q,d) = ∑ Wt,d
  • 34. Term Frequency Weight Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 2.79 0 2.71 2.69 2.38 2.46 zaphod 3.33 0 2.94 3.61 1.30 0 ship 2.77 1.30 1.93 3.10 2.43 3.08 arthur 3.54 0 3.58 3.37 3.50 0 fiordland 0 1.95 0 0 0 0 santorini 0 0 1.47 0 0 0 wordlings 0 0 0 0 1.00 0
  • 35. idf weights Term df idf galaxy 5 0.079 zaphod 4 0.176 ship 6 0 arthur 4 0.176 fiordland 1 0.778 santorini 1 0.778 wordlings 1 0.778 idft = log10(N/dft), here N = 6
  • 36. tf-idf Weight Hitchhiker’s Last Life, Restaurant So Long & Starship Guide to Chance to Universe & at End of Thanks for Titanic Galaxy See Everything Universe all the Fish galaxy 0.2204 0 0.2140 0.2125 0.1880 0.1943 zaphod 0.5861 0 0.5174 0.6354 0.2288 0 ship 0 0 0 0 0 0 arthur 0.6230 0 0.6301 0.5931 0.6160 0 fiordland 0 1.5171 0 0 0 0 santorini 0 0 1.1437 0 0 0 wordlings 0 0 0 0 0.7780 0
  • 37. Agenda • Introduction • Basic Text Processing • IR Basics • Term Frequency Weighting • Inverse Document Frequency Weighting • TF – IDF Score • Activity: Spelling Correction
  • 38. Using a Dictionary • How to tell if a word’s spelling is correct or not? • Maybe, use a dictionary? Like Oxford’s Dictionary • Then what about terms like “Accenture” or “Brangelina” or “Paani da Rang”? • Any dictionary definitely doesn’t contain words like this. • Such terms will be flagged as spelling errors. But this should not be the case!
  • 39. Using a Corpus • So, we use a collection of documents. • This collection is used as a basis for spell correction. • To correct the spelling  we find a word in the collection which is nearest to the wrongly spelled word.  Replace the wrong word with the new word we just found.
  • 40. Minimum Edit Distance • Minimum number of edit operations  Insertion (add a letter)  Deletion (remove one letter)  Substitution (change one letter to another)  Transposition (swap adjacent letters) needed to transform one word into the other.
  • 41. Minimum Edit Distance (Example) * B I O G R A P H Y A U T O G R A P H * i s s d • Let cost of each operation be 1  Total edit distance between these words = 4
  • 42. Spelling Correction • For a given word, find all words at edit distance 1 and 2. • Which of these words is the most likely spelling correction for the given word? • The one that occurs the maximum time in the corpus. That’s the answer!
  • 43. Minimum Edit Distance • Finding all words at edit distance 1, will result in a huge collection. • For a word of length n,  Insertions: 26(n + 1)  Deletions: n  Substitution: 26n  Transposition: n – 1  TOTAL: 54n + 25 • Few of these might be duplicates • Number of words at edit distance 2 will be obviously more than 54n + 25.
  • 44. Complete Algorithm 1. Calculate the frequency of each word in the corpus. 2. Input a word to be corrected. 3. If input word is present in corpus, return that word. 4. Else, find all words at an edit distance 1 and 2. 5. Among these words, return the word which occurs the maximum time in the corpus. 6. If none of these words occur in corpus, return the original word.
  • 45. Evaluation number of words successfully corrected Accuracy = number of input words
  • 46. You’re Given • A collection of documents (corpus)  Public domain books from Project Gutenberg  List of most frequent words from Wikitionary  British National Corpus put together by Peter Norvig. • Starter code in Java, Python and C#  Contains code for reading corpus and calculating accuracy. • A test set from Roger Mitton’s Birckbeck Spelling Error Corpus (slightly modified)  To test your algorithm.
  • 47. TODO • Implement the algorithm in Java, Python or C#  Successful implementation will result in accuracy of ~31.50% • Modify the given algorithm to increase the accuracy to 50% • Going beyond 50% for the given test set is a bit challenging task.
  • 48. Further Reading Information Retrieval: 1. http://nlp.stanford.edu/fsnlp/ 2. http://nlp.stanford.edu/IR-book/ 3. https://class.coursera.org/nlp/class/index 4. http://en.wikipedia.org/wiki/Tf*idf 5. http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt 6. http://research.google.com/pubs/InformationRetrievalandtheWeb.html 7. http://norvig.com/ 8. http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and- the-vector-space-model-1.html 9. http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-864- advanced-natural-language-processing-fall-2005/index.htm 10. http://www.slideshare.net/butest/search-engines-3859807
  • 49. Further Reading Spelling Correction: 1. http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf 2. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.1392 3. http://www.stanford.edu/class/cs276/handouts/spelling.pptx 4. http://alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html 5. http://portal.acm.org/citation.cfm?id=146380 6. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.9400 7. http://static.googleusercontent.com/external_content/untrusted_dlcp/research. google.com/en/us/pubs/archive/36180.pdf 8. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9D A285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf

Editor's Notes

  1. What is Information Retrieval? Suppose you want to know the current score of the Euro Cup Finals match. You either Google for it or go to some sports website.Suppose you want to know when is the wedding of Brad Pitt and Angelina Jolie. You go to Google and search for it.You want to know the meaning of some word, say anarchist. You pick up the Oxford (or any other) dictionary from your shelf and get the meaning.Basically you want some information. So, you go to the internet or some book shelf or some where else to retrievethat information. That is Information Retrieval (IR). Systems like Google, Bing, etc that helps you retrieve information are called IR Systems. That’s what we will be focusing on.Search Engines search data from the internet. The data on the internet is mostly unstructured. And internet is huge. So that’s where this definition comes from.A few examples of IR systems:Whenever you get an email saying “party is at Tree Garden on 12 Jan 8:00 PM”,the email systems automatically identifies it and offers you to create a calendar event.Spam Email detection.Automated Essay evaluation.Showing targeted ads.
  2. Of course search engines today are capable of doing a lot of things. You can search for images, multimedia, research papers, books etc.But in this presentation, we will be concentrating only on text search.A lot of text is available on the internet. We need to process that text so that information can be retrieved from it.The first thing that we need to do is to extract words from a sentence, also called as tokenization.
  3. As far as English is concerned, any two words are always separated by spaces. We use this fact to tokenize a sentence into words. Let us try our hands at tokenizing a few sentences. “Hey, where have you been last week?”This one is easy. The sentence has 7 words.“I flew from New York to Illinois and finally landed in Jammu &amp; Kashmir.”Try this. First 3 words are obviously “I”, “flew”, and “from”. What about the 4th word? Is it “New” or “New York”? What about “Jammu &amp; Kashmir”? Should this be considered as 1 word or 3?“Long Journey, huh?”Should words like “huh”, “hmmm”, etc be considered?
  4. Here are a few more examples to what we discussed in last slide.Should words like New York and San Francisco be considered as one?What about words like Huh, Hmm, uh, etc?How to deal with apostrophes?Should “isn’t” be considered “isn’t” or “isnt” or “is not”?What about hyphens?What about the dots in abbreviations?
  5. Till last slide we were only considering English. What about other languages?In German, compound nouns are not segmented. There is no space between words. How to tokeninze?
  6. Apart from tokenization, this is another method for text processing.Certain texts has some specific patterns. Like email ids. They always start with a word followed by @ followed by a word followed by some domain name like .com or .org or .co.inDates that we get in invitation emails has a particular pattern.Likewise phone numbers and zip code has a particular pattern.Regular Expressions also called regex, are a way to represent the text pattern.We use these regexes to extract (retrieve) information. Suppose, we want to find all the email ids on a particular page (so that we can spam those ;), kidding). We use the email id regex.
  7. There are many other methods for text processing. But covering all of them is beyond the scope of this presentation.Let us move on to IR Basics. Although there are many IR systems, we will be concentrating only on search engines.In IR Basics, we will get into the basic search model, and how to evaluate the performance of an IR System. We will get introduced to ranked IR.
  8. User needs some information. For example, “What is the UN doing to normalize the situation in Syria?” He turns it to a query say, “UN actions on Syria”. He then goes to a search engine and search for it. Now, the search engine looks into the various website it has indexed and returns the search results. If the user gets the required information, its done.But sometimes, the results are not satisfactory. So the user refines the query say, “UN actions in Syria violence” and the cycle is repeated.
  9. How to evaluate the accuracy of an IR System?To do so, we consider precision and recall. The harmonic mean of the two is generally the accepted measure for evaluating IR Systems.But for search engines, ranking is also equally important. There is no method as such to evaluate the ranking. It generally depends on the user.
  10. Ranked Information RetrievalThe information that we retrieve needs to be ranked. This ranking is generally done on the basis of relevance. More the relevant document, higher should be the rank.Now the question is how to rank the retrieved docs?In school, ranking is done on the basis of score. That’s what we do here as well. We will give a score to each document.This score will reflect how much relevant is a particular document with respect to a query.
  11. One thing is pretty obvious.The document that we retrieve needs to have the query terms. If the document doesn’t contain the query term at all, it is not relevant. Which means, if the query terms don’t occur in a document, the score of that particular document should be 0.And more the number of query terms, higher should be the score.
  12. Here, I have put up a small matrix. I have considered 6 books written by Douglas Adams. The matrix shows frequency of some of the words in these books. So, the word zaphod occurs 234 times in HitchHiker’s Guide to Galaxy, 88 times in Life, Universe and Everything, 405 times in Restaurant at the End of Universe, and 2 times in So Long and Thanks for all the Fish.These frequencies will help us in deciding the ranking of the documents.
  13. But raw frequencies of words may mislead us. Because, if a term occurs 85 times in one document and 1 times in the other, the first one is of course more relevant than the 2nd one, but not 85 times more relevant.Relevance and Frequency are not linearly related. We need a way to scale these frequencies.
  14. And logarithm can help us scale the frequencies.So, what we do is instead of taking the raw frequencies, we take their weight.The weight is obviously 0 if the term doesn’t occur in the document. If it occurs, then we take the log of the raw frequencies.We add 1 to the log because if term frequency is 1 then log will be 0. This is not what we want.
  15. And finally, to calculate the term frequency score, we sum the weights of all the terms in the query across all documents.
  16. Here’s the same term frequency matrix that we saw a few slides back.
  17. And here’s the log of those term frequencies. We can clearly see how the weights are scaled. If the frequency is 214, then the weight is 2.22 and if frequency is 2, the weight is 1.20The difference has clearly dampened.
  18. But while using term frequencies, we consider all the terms as equally important. But certain terms are of no use when determining relevance.
  19. In fact, important terms are sometimes rarer than non-important terms. For example, information retrieval. How many do you think a document on IR will contain this term? This term is definitely going to be are as compared to other terms.So, frequent terms are less informative than rare terms. A document containing these terms might be relevant, but there is no surety about it.So, we should give positive weights to frequent terms, but rarer terms should get higher weights.
  20. For this purpose, we use the document frequency.df is the number of documents in the collection that contains the term t.If the term earthling is found in 1 document, its document frequency is 1; if it is found in 3, its df is 3.But document frequency is the inverse measure of relevance of a document.
  21. We use logarithm to dampen the weights, the same reason why we used it in term frequency weight.
  22. Suppose we have 107 documents in the collection. Words like the, there, is, of, etc occur in almost every document. So the document frequency for such words will be 107. And their idf weight will be 0. While if some terms occur rarely in the document collection, its weight goes up.Since we have assumed that our collection is static, each term will have a unique idf weight.
  23. Here’s the idf weights of the terms that we were discussing for tf.Remember that here N = 6. We are taking only 6 documents into consideration.
  24. IMPORTANT: There is hyphen (-) between tf and idf. It is not a minus sign.This is the best known scheme available in IR.
  25. This is the final score of the document.But the journey doesn’t end here. The whole thing is represented as a Vector Space Model (VSM) and the Cosine Similarity is taken. But these topics are beyond the scope of this presentation.
  26. tf weights – What we discussed.
  27. idf weights – The slide that we discussed.
  28. The tf-idf weights of the terms that we are discussing.Let us now suppose that the query is “zaphod galaxy”. So add the tf-idf weight for both the terms for each document. This will give you the score of the document. Example, the score of Hitchhiker’s Guide to Galaxy for this query is 0.2204 + 0.5861 = 0.8065. The score for Starship Titanic is 0.1943 + 0 = 0.1943 Likewise, the document with the highest score is the most relevant document.The doc with second highest score is second most relevant doc.
  29. How to tell if a spelling is correct or not?The most obvious answer is check the dictionary. But then terms which are proper nouns or some song. The dictionary won’t have these words.
  30. We use a collection of documents available on the web. A collection of document is called a corpus in IR. So, we use a corpus for spell correction.Spelling mistakes are mostly silly mistake that a user does while typing. So we can say that what the user typed is not completely different that what he intended to type. It is almost similar to the correct spelling but he made a minor mistake somewhere. So, to correct the spelling, we find out a word from the corpus which is nearest to the wrongly spell word.Focus on “nearest”. This word gives a sense of distance between words. So, how do we define that?
  31. The distance between two words is the Minimum Edit Distance. Or how many edits will we have to make to convert one word into another.
  32. Let us look at an example.The two words biography and autograph. We will convert the ‘biography’ into ‘autograph’.The ‘ograph’ part of both the words is the same. We keep it unchanged.Now, first we will have to insert an ‘A’.Next, we substitute ‘B’ with ‘U’.Then, we substitute ‘I’ with ‘T’.Finally, we delete the extra ‘Y’.So, 4 edits were required to convert ‘biography’ into ‘autograph’, 1 insertion, 2 substitution, and 1 deletion.So we say that the edit distance between these two words is 4.
  33. As we discussed, spelling errors are mostly minors. For spelling correction, we do not need to go further than edit distance of 2.So, what we do is we find all the words which are at an edit distance of 1 or 2 from the wrongly spelled word. Our answer is the word that occurs the maximum time in the corpus.