Model of information retrieval (3)


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Model of information retrieval (3)

  2. 2. INFORMATION RETRIEVAL  Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources.  Searches can be based on metadata or on full-text (or other content-based) indexing.  Goal: Find the documents most relevant to a certain Query  Dealing with notions of:  Collection of documents  Query (User’s information need)  Notion of Relevancy
  3. 3. MODEL  A model is a construct designed help us understand a complex system  A particular way of “looking at things”  Models inevitably make simplifying assumptions  What are the limitations of the model?  Different types of models:  Conceptual models  Physical analog models  Mathematical models
  4. 4. Retrieval Models A retrieval model specifies the details of:  Document representation  Query representation  Retrieval function Determines a notion of relevance. Notion of relevance can be binary or continuous (i.e. ranked retrieval).
  5. 5. CLASSES OF RM Boolean models (set theoretic)  Extended Boolean Vector space models (statistical/algebraic)  Generalized VS  Latent Semantic Indexing Probabilistic models
  6. 6. MODELS OF IR  Boolean model  Based on the notion of sets  Documents are retrieved only if they satisfy Boolean conditions specified in the query  Does not impose a ranking on retrieved documents  Exact match  Vector space model  Based on geometry, the notion of vectors in high dimensional space  Documents are ranked based on their similarity to the query (ranked retrieval)  Best/partial match
  7. 7.  Language models  Based on the notion of probabilities and processes for generating text  Documents are ranked based on the probability that they generated the query  Best/partial match
  8. 8. BOOLEAN MODEL  Invented by George Boole (1815-1864)  He devised a system of symbolic logic in which he used three operators (+, , - ) to combine statements in symbolic form.  John Venn named to this operators of Boolean logic are the logical sum(+), logical product(), and logical difference(-).  IR systems allow the users to express their queries by using this operators.
  9. 9. BOOLEAN MODEL  Each index term is either present or absent  Documents are either Relevant or Not Relevant(no ranking)  A document is represented as a set of keywords.  Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope.  [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]  Output: Document is relevant or not. No partial matches or ranking.
  10. 10. BOOLEAN RETRIEVAL MODEL  Popular retrieval model because:  Easy to understand for simple queries.  Clean formalism.  Boolean models can be extended to include ranking.  Reasonably efficient implementations possible for normal queries.
  11. 11. BOOLEAN MODEL  Weights assigned to terms are either “0” or “1”  “0” represents “absence”: term isn’t in the document  “1” represents “presence”: term is in the document  Build queries by combining terms with Boolean operators  AND, OR, NOT  The system returns all documents that satisfy the query
  12. 12. AND/OR/NOT A B C
  13. 13. Why Boolean Retrieval Works  Boolean operators approximate natural language  Find documents about a good party that is not over  AND can discover relationships between concepts  good party  OR can discover alternate terminology  excellent party, wild party, etc.  NOT can discover alternate meanings  Democratic party
  14. 14. The Perfect Query Paradox  Every information need has a perfect set of documents  If not, there would be no sense doing retrieval  Every document set has a perfect query  AND every word in a document to get a query for it  Repeat for each document in the set  OR every document query to get the set query  But can users realistically be expected to formulate this perfect query?  Boolean query formulation is hard!
  15. 15. Why Boolean Retrieval Fails • Natural language is way more complex • AND “discovers” nonexistent relationships – Terms in different sentences, paragraphs, … • Guessing terminology for OR is hard – good, nice, excellent, outstanding, awesome, … • Guessing terms to exclude is even harder! – Democratic party, party to a lawsuit, …
  16. 16. BOOLEAN MODEL  Strengths  Precise, if you know the right strategies  Precise, if you have an idea of what you’re looking for  Efficient for the computer  Simple  Weaknesses  Users must learn Boolean logic  Boolean logic insufficient to capture the richness of language  No control over size of result set: either too many documents or none  When do you stop reading? All documents in the result set are considered “equally good”  What about partial matches? Documents that “don’t quite match” the query may be useful also  No notion of ranking (exact matching only)  All index terms have equal weight
  17. 17. PROBLEMS  Very rigid: AND means all; OR means any.  Difficult to express complex user requests.  Difficult to control the number of documents retrieved.  All matched documents will be returned.  Difficult to rank output.  All matched documents logically satisfy the query.  Difficult to perform relevance feedback.  If a document is identified by the user as relevant or irrelevant, how should the query be modified?
  18. 18. ADVANTAGES & DISADVANTAGES  Advantages  Results are predictable, relatively easy to explain  Many different features can be incorporated  Efficient processing since many documents can be eliminated from search  Disadvantages  Effectiveness depends entirely on user  Simple queries usually don’t work well  Complex queries are difficult.
  19. 19. LIMITATIONS  The first relates to the formulation of search statements.  It has been noted that users are not able to formulate an exact search statement by the combination of AND, OR and NOT operators, especially when several query terms are involved.  In such cases either the search statement becomes too narrow or too broad.  The second limitation relates to the number of retrieval items.  It has been noted that users cannot predict a priori exactly how many items are to be retrieved to satisfy a given query.  If the search statement is broad, the number of retrieved items may sometimes be several hundreds and thus it may be quite difficult to find out the exact information required.  The third limitation is that it identifies an item as relevant by finding out whether a given query term is present or not in a given record in the database.