SlideShare a Scribd company logo
1 of 16
INFORMATION RETRIEVAL
Information Retrieval
• Information retrieval is the task of finding documents that are
relevant to a user’s need for information.
• The best-known examples of information retrieval systems are search
engines on the World Wide Web.
• A Web user can type a query such as [AI book] into a search engine
and see a list of relevant pages.
• In this section, we will see how such systems are built.
An IR system can be characterized by
1. A corpus of documents. Each system must decide what it wants to treat
as a document: a paragraph, a page, or a multipage text.
2. Queries posed in a query language. A query specifies what the user
wants to know.
3. A result set. This is the subset of documents that the IR system judges to
be relevant to the query.
4. A presentation of the result set. This can be as simple as a ranked list of
document titles.
The earliest IR systems worked on a Boolean keyword model.
Cont.,
• This model has the advantage of being
• simple to explain and implement.
• Disadvantage:
• First, the degree of relevance of a document is a single bit, so there is
no guidance as to how to order the relevant documents for
presentation.
• Second, Boolean expressions are unfamiliar to users who are not
programmers or logicians.
• Third, it can be hard to formulate an appropriate query, even for a
skilled user.
IR scoring functions
• Most IR systems have abandoned the Boolean model and use models
based on the statistics of word counts.
• We describe the BM25 scoring function, which comes from the Okapi
project of Stephen Robertson and Karen Sparck Jones at London’s City
College, and has been used in search engines such as the open-source
Lucene project.
• A scoring function takes a document and a query and returns a numeric
score; the most relevant documents have the highest scores.
• In the BM25 function, the score is a linear weighted combination of scores
for each of the words that make up the query.
Three factors affect the weight of a query term:
• First, the frequency with which a query term appears in a document:
• For the query [farming in Kansas], documents that mention “farming”
frequently will have higher scores.
• Second, the inverse document frequency of the term, or IDF.
• The word “in” appears in almost every document, so it has a high document
frequency, and thus a low inverse document frequency, and thus it is not as
important to the query as “farming” or “Kansas.”
• Third, the length of the document.
• A million-word document will probably mention all the query words, but may
not actually be about the query. A short document that mentions all the
words is a much better candidate.
Cont.,
The BM25 function takes all three of these into account. Then, given a document
dj and a query consisting of the words q1:N, we have
TF(qi, dj ), the count of the number of times word qi appears in document dj .
DF(qi), that gives the number of documents that contain the word qi.
We have two parameters, k and b, that can be tuned by cross-validation; typical
values are k = 2.0 and b = 0.75.
Cont.,
• L is the average document length
• |dj | is the length of document dj in words
• IDF(qi) is the inverse document frequency of word qi, given by
IR system evaluation
• How do we know whether an IR system is performing well?
• We undertake an experiment in which the system is given a set of
queries and the result sets are scored with respect to human
relevance judgments.
• Traditionally, there have been two measures used in the scoring:
• recall
• precision.
Cont.,
• For example, Imagine that an IR system has returned a result set for a
single query, for which we know which documents are and are not
relevant, out of a corpus of 100 documents. The document counts in
each category are given in the following table:
Cont.,
• Precision measures the proportion of documents in the
result set that are actually relevant.
• In our example, the precision is 30/(30 + 10)=.75.
• The false positive rate is 1 − .75=.25.
• Recall measures the proportion of all the relevant
documents in the collection that are in the result set.
• In our example, recall is 30/(30 + 20)=.60.
• The false negative rate is 1 − .60=.40.
IR refinements
• There are many possible refinements to the system described here,
and indeed Web search engines are continually updating their
algorithms as they discover new approaches and as the Web grows
and changes.
• One common refinement is a better model of the effect of
document length on relevance.
The PageRank algorithm
• PageRank was one of the two original ideas that set Google’s search apart from
other Web search engines when it was introduced in 1997.
• PageRank was invented to solve the problem of the tyranny of TF scores: if the
query is [IBM], how do we make sure that IBM’s home page, ibm.com, is the first
result, even if another page mentions the term “IBM” more frequently?
• The idea is that ibm.com has many in-links (links to the page), so it should be
ranked higher:
• the PageRank algorithm is designed to weight links from high-quality sites more
heavily. What is a highquality site? One that is linked to by other high-quality
sites.
The HITS algorithm
• The Hyperlink-Induced Topic Search algorithm, also known as “Hubs
and Authorities” or HITS, is another influential link-analysis algorithm.
• HITS differs from PageRank in several ways.
• First, it is a query-dependent measure: it rates pages with respect to a
query.
• Given a query, HITS first finds a set of pages that are relevant to the
query.
Cont.,
• It does that by intersecting hit lists of query words, and then adding
pages in the link neighborhood of these pages—pages that link to or
are linked from one of the pages in the original relevant set.
• Each page in this set is considered an authority on the query to the
degree that other HUB pages in the relevant set point to it.
• A page is considered a hub to the degree that it points to other
authoritative pages in the relevant set.
Question answering
• Information retrieval is the task of finding documents that are
relevant to a query, where the query may be a question, or just a
topic area or concept.
• Question answering is a somewhat different task, in which the query
really is a question, and the answer is not a ranked list of documents
but rather a short response—a sentence, or even just a phrase.
• There have been question-answering NLP (natural language
processing) systems since the 1960s, but only since 2001 have such
systems used Web information retrieval to radically increase their
breadth of coverage.

More Related Content

What's hot

Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)Amir Fahmideh
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Suffix Tree and Suffix Array
Suffix Tree and Suffix ArraySuffix Tree and Suffix Array
Suffix Tree and Suffix ArrayHarshit Agarwal
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data MiningKamal Acharya
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsVaibhav Khanna
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 

What's hot (20)

Web content mining
Web content miningWeb content mining
Web content mining
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Suffix Tree and Suffix Array
Suffix Tree and Suffix ArraySuffix Tree and Suffix Array
Suffix Tree and Suffix Array
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Web data mining
Web data miningWeb data mining
Web data mining
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
lec6
lec6lec6
lec6
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Text MIning
Text MIningText MIning
Text MIning
 
Web content mining
Web content miningWeb content mining
Web content mining
 

Similar to information retrieval

Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structuressonykhan3
 
Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt doPonnuthuraiSelvaraj1
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalVikas Bhushan
 
Phrase based Indexing and Information Retrieval
Phrase based Indexing and Information RetrievalPhrase based Indexing and Information Retrieval
Phrase based Indexing and Information RetrievalBala Abirami
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 
'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015 'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015 Georgina Tilby
 
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaQuery Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaYI-JHEN LIN
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir modelsVaibhav Khanna
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfHabtamu100
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryPriyatham Bollimpalli
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Techniquepaperpublications3
 
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comPersonalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comLucidworks
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IUNCResearchHub
 
Combining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationCombining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationSonia Haiduc
 

Similar to information retrieval (20)

Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
 
qury.pdf
qury.pdfqury.pdf
qury.pdf
 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structures
 
Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt do
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
 
Phrase based Indexing and Information Retrieval
Phrase based Indexing and Information RetrievalPhrase based Indexing and Information Retrieval
Phrase based Indexing and Information Retrieval
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015 'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaQuery Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir models
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Technique
 
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comPersonalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
 
Combining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationCombining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept Location
 

More from ssbd6985

UNIT-3 Servlet
UNIT-3 ServletUNIT-3 Servlet
UNIT-3 Servletssbd6985
 
Best methods of staff selection and motivation
Best methods of staff selection and motivationBest methods of staff selection and motivation
Best methods of staff selection and motivationssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extractionssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extractionssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extractionssbd6985
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
 
Expert System Full Details
Expert System Full DetailsExpert System Full Details
Expert System Full Detailsssbd6985
 

More from ssbd6985 (7)

UNIT-3 Servlet
UNIT-3 ServletUNIT-3 Servlet
UNIT-3 Servlet
 
Best methods of staff selection and motivation
Best methods of staff selection and motivationBest methods of staff selection and motivation
Best methods of staff selection and motivation
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Expert System Full Details
Expert System Full DetailsExpert System Full Details
Expert System Full Details
 

Recently uploaded

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Recently uploaded (20)

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

information retrieval

  • 2. Information Retrieval • Information retrieval is the task of finding documents that are relevant to a user’s need for information. • The best-known examples of information retrieval systems are search engines on the World Wide Web. • A Web user can type a query such as [AI book] into a search engine and see a list of relevant pages. • In this section, we will see how such systems are built.
  • 3. An IR system can be characterized by 1. A corpus of documents. Each system must decide what it wants to treat as a document: a paragraph, a page, or a multipage text. 2. Queries posed in a query language. A query specifies what the user wants to know. 3. A result set. This is the subset of documents that the IR system judges to be relevant to the query. 4. A presentation of the result set. This can be as simple as a ranked list of document titles. The earliest IR systems worked on a Boolean keyword model.
  • 4. Cont., • This model has the advantage of being • simple to explain and implement. • Disadvantage: • First, the degree of relevance of a document is a single bit, so there is no guidance as to how to order the relevant documents for presentation. • Second, Boolean expressions are unfamiliar to users who are not programmers or logicians. • Third, it can be hard to formulate an appropriate query, even for a skilled user.
  • 5. IR scoring functions • Most IR systems have abandoned the Boolean model and use models based on the statistics of word counts. • We describe the BM25 scoring function, which comes from the Okapi project of Stephen Robertson and Karen Sparck Jones at London’s City College, and has been used in search engines such as the open-source Lucene project. • A scoring function takes a document and a query and returns a numeric score; the most relevant documents have the highest scores. • In the BM25 function, the score is a linear weighted combination of scores for each of the words that make up the query.
  • 6. Three factors affect the weight of a query term: • First, the frequency with which a query term appears in a document: • For the query [farming in Kansas], documents that mention “farming” frequently will have higher scores. • Second, the inverse document frequency of the term, or IDF. • The word “in” appears in almost every document, so it has a high document frequency, and thus a low inverse document frequency, and thus it is not as important to the query as “farming” or “Kansas.” • Third, the length of the document. • A million-word document will probably mention all the query words, but may not actually be about the query. A short document that mentions all the words is a much better candidate.
  • 7. Cont., The BM25 function takes all three of these into account. Then, given a document dj and a query consisting of the words q1:N, we have TF(qi, dj ), the count of the number of times word qi appears in document dj . DF(qi), that gives the number of documents that contain the word qi. We have two parameters, k and b, that can be tuned by cross-validation; typical values are k = 2.0 and b = 0.75.
  • 8. Cont., • L is the average document length • |dj | is the length of document dj in words • IDF(qi) is the inverse document frequency of word qi, given by
  • 9. IR system evaluation • How do we know whether an IR system is performing well? • We undertake an experiment in which the system is given a set of queries and the result sets are scored with respect to human relevance judgments. • Traditionally, there have been two measures used in the scoring: • recall • precision.
  • 10. Cont., • For example, Imagine that an IR system has returned a result set for a single query, for which we know which documents are and are not relevant, out of a corpus of 100 documents. The document counts in each category are given in the following table:
  • 11. Cont., • Precision measures the proportion of documents in the result set that are actually relevant. • In our example, the precision is 30/(30 + 10)=.75. • The false positive rate is 1 − .75=.25. • Recall measures the proportion of all the relevant documents in the collection that are in the result set. • In our example, recall is 30/(30 + 20)=.60. • The false negative rate is 1 − .60=.40.
  • 12. IR refinements • There are many possible refinements to the system described here, and indeed Web search engines are continually updating their algorithms as they discover new approaches and as the Web grows and changes. • One common refinement is a better model of the effect of document length on relevance.
  • 13. The PageRank algorithm • PageRank was one of the two original ideas that set Google’s search apart from other Web search engines when it was introduced in 1997. • PageRank was invented to solve the problem of the tyranny of TF scores: if the query is [IBM], how do we make sure that IBM’s home page, ibm.com, is the first result, even if another page mentions the term “IBM” more frequently? • The idea is that ibm.com has many in-links (links to the page), so it should be ranked higher: • the PageRank algorithm is designed to weight links from high-quality sites more heavily. What is a highquality site? One that is linked to by other high-quality sites.
  • 14. The HITS algorithm • The Hyperlink-Induced Topic Search algorithm, also known as “Hubs and Authorities” or HITS, is another influential link-analysis algorithm. • HITS differs from PageRank in several ways. • First, it is a query-dependent measure: it rates pages with respect to a query. • Given a query, HITS first finds a set of pages that are relevant to the query.
  • 15. Cont., • It does that by intersecting hit lists of query words, and then adding pages in the link neighborhood of these pages—pages that link to or are linked from one of the pages in the original relevant set. • Each page in this set is considered an authority on the query to the degree that other HUB pages in the relevant set point to it. • A page is considered a hub to the degree that it points to other authoritative pages in the relevant set.
  • 16. Question answering • Information retrieval is the task of finding documents that are relevant to a query, where the query may be a question, or just a topic area or concept. • Question answering is a somewhat different task, in which the query really is a question, and the answer is not a ranked list of documents but rather a short response—a sentence, or even just a phrase. • There have been question-answering NLP (natural language processing) systems since the 1960s, but only since 2001 have such systems used Web information retrieval to radically increase their breadth of coverage.