Vsm 벡터공간모델


Published on

정보검색시스템 강의노트_강승식교수님

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Vsm 벡터공간모델

  1. 1. Chapter 2 Modeling
  2. 2. Contents <ul><li>Introduction </li></ul><ul><li>Taxonomy of IR Models </li></ul><ul><li>Retrieval : Ad hoc, Filtering </li></ul><ul><li>Formal Characterization of IR Models </li></ul><ul><li>Classic IR Models </li></ul><ul><li>Alternative Set Theoretic Models </li></ul><ul><li>Alternative Algebraic Models </li></ul><ul><li>Alternative Probabilistic Models </li></ul>
  3. 3. Contents (Cont.) <ul><li>Structured Text Retrieval Models </li></ul><ul><li>Models for Browsing </li></ul><ul><li>Trends and Research Issues </li></ul>
  4. 4. 2.1 Introduction <ul><li>Traditional IR System </li></ul><ul><ul><li>Adopt index terms to index and retrieve documents </li></ul></ul><ul><li>Index Term </li></ul><ul><ul><li>Restricted sense </li></ul></ul><ul><ul><ul><li>Keyword which has some meaning of its own (usually noun) </li></ul></ul></ul><ul><ul><li>General form </li></ul></ul><ul><ul><ul><li>Any word which appears in the text of a document </li></ul></ul></ul><ul><li>Ranking Algorithm </li></ul><ul><ul><li>Attempt to establish a simple ordering of the documents retrieved </li></ul></ul><ul><ul><li>Operate according to basic premises regarding the notion of document relevance </li></ul></ul>
  5. 5. 2.2 A Taxonomy of IR Models Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network U s e r T a s k Retrieval: Ad hoc Filtering Browsing Classic Models boolean vector probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Flat Structure Guided Hypertext
  6. 6. A Taxonomy of IR Models (Cont.) <ul><li>Retrieval models </li></ul><ul><ul><li>Most frequently associated with distinct combinations of a document logical view and a user task </li></ul></ul>Logical View of Documents U S E R T A S K Structure Guided Hypertext Flat Hypertext Flat Browsing Structured Classic Set theoretic Algebraic Probabilistic Classic Set theoretic Algebraic Probabilistic Retrieval Full Text + Structure Full Text Index Terms
  7. 7. 2.3 Retrieval <ul><li>Ad hoc </li></ul><ul><ul><li>The documents in the collection remain relatively static while new queries are submitted to the system </li></ul></ul><ul><ul><li>The most common form of user task </li></ul></ul><ul><li>Filtering </li></ul><ul><ul><li>The queries remain relatively static while new documents come into the system (and leave) </li></ul></ul><ul><ul><li>User profile </li></ul></ul><ul><ul><ul><li>Describing the user’s preferences </li></ul></ul></ul><ul><ul><li>Routing (variation of filtering, rank the filtered document) </li></ul></ul>
  8. 8. 2.4 A Formal Characterization of IR Models <ul><li>IR Model </li></ul>
  9. 9. 2.5 Classic Information Retrieval <ul><li>Boolean Model </li></ul><ul><ul><li>Based on set theory and Boolean algebra </li></ul></ul><ul><ul><li>Queries are specified as Boolean expressions </li></ul></ul><ul><ul><li>Model considers that index terms are present or absent in a document </li></ul></ul><ul><li>Vector Model </li></ul><ul><ul><li>Partial matching is possible </li></ul></ul><ul><ul><li>Assign non-binary weights to index terms </li></ul></ul><ul><ul><li>Term weights are used to compute the degree of similarity </li></ul></ul><ul><li>Probabilistic Model </li></ul><ul><ul><li>Given a query, the model assigns each document d j , as a measure of similarity to the query, p( d j relevant to q )/p( d j non-relevant to q ) which computes the odds of the document d j being relevant to the query q </li></ul></ul>
  10. 10. 2.5.1 Basic Concepts <ul><li>Index Term </li></ul><ul><ul><li>Word whose semantics helps in remembering the document’s main themes </li></ul></ul><ul><ul><li>Mainly nouns </li></ul></ul><ul><ul><ul><li>Nouns have meaning by themselves </li></ul></ul></ul><ul><ul><li>Weights </li></ul></ul><ul><ul><ul><li>All terms are not equally useful for describing the document </li></ul></ul></ul><ul><ul><li>Definition </li></ul></ul>
  11. 11. Basic Concepts (Cont.) <ul><li>Mutual Independence </li></ul><ul><ul><li>Index term weights are usually assumed to be mutually independent </li></ul></ul><ul><ul><li>Knowing the weight w ij associated with the pair ( k i , d j ) tells us nothing about the weight w (i+1)j associated with the pair ( k i+1 , d j ) </li></ul></ul><ul><ul><li>It does simplify the task of computing index term weights and allows for fast ranking computation </li></ul></ul>
  12. 12. 2.5.2 Boolean Model <ul><li>Base </li></ul><ul><ul><li>Simple retrieval model based on Set theory and Boolean algebra </li></ul></ul><ul><ul><li>Operation : and, or, not </li></ul></ul><ul><li>Advantage </li></ul><ul><ul><li>Clean formalism </li></ul></ul><ul><ul><li>Boolean query expressions have precise semantics </li></ul></ul><ul><li>Disadvantage </li></ul><ul><ul><li>Binary decision (no notion of a partial match) </li></ul></ul><ul><ul><ul><li>Retrieval of too few or too many document </li></ul></ul></ul><ul><ul><li>Difficult to express their query requests in terms of Boolean expressions </li></ul></ul>
  13. 13. Boolean Model (Cont.) <ul><li>Definition </li></ul><ul><li>Example </li></ul>k a k b k c
  14. 14. Boolean Model (Cont.) 병렬 프로그램 시스템 1 1 0 … 0 1 1 … 0 0 1 … 1 0 1 … 병렬 프로그램 시스템 … 색인어 1 0 0 1 유사도 004 003 002 001 문서
  15. 15. 2.5.3 Vector model <ul><li>Motivation </li></ul><ul><ul><li>Binary weights is too limiting </li></ul></ul><ul><ul><ul><li>Assign non-binary weights to index terms </li></ul></ul></ul><ul><ul><li>A framework in which partial matching is possible </li></ul></ul><ul><ul><ul><li>Instead of attempting to predict whether a document is relevant or not </li></ul></ul></ul><ul><ul><ul><li>Rank the documents according to their degree of similarity to the query </li></ul></ul></ul>
  16. 16. Vector model (Cont.) <ul><li>Definition </li></ul>
  17. 17. Vector model (Cont.) <ul><li>Clustering Problem </li></ul><ul><ul><li>Intra-cluster similarity </li></ul></ul><ul><ul><ul><li>What are the features which better describe the objects </li></ul></ul></ul><ul><ul><li>Inter-cluster similarity </li></ul></ul><ul><ul><ul><li>What are the features which better distinguish the objects </li></ul></ul></ul><ul><li>IR Problem </li></ul><ul><ul><li>Intra-cluster similarity ( tf factor) </li></ul></ul><ul><ul><ul><li>Raw frequency of a term k i inside a document d j </li></ul></ul></ul><ul><ul><li>Inter-cluster similarity ( idf factor) </li></ul></ul><ul><ul><ul><li>Inverse of the frequency of a term k i among the documents </li></ul></ul></ul>
  18. 18. Vector model (Cont.) <ul><li>Weighting Scheme </li></ul><ul><ul><li>Term Frequency ( tf ) </li></ul></ul><ul><ul><ul><li>Measure of how well that term describes the document contents </li></ul></ul></ul><ul><ul><li>Inverse Document Frequency ( idf ) </li></ul></ul><ul><ul><ul><li>Terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one </li></ul></ul></ul>
  19. 19. Vector model (Cont.) <ul><li>Best known index term weighting scheme </li></ul><ul><ul><li>Balance tf and idf ( tf-idf scheme) </li></ul></ul><ul><li>Query term weighting scheme </li></ul>
  20. 20. Vector model (Cont.) .176 .176 .477 0 0 .176 .477 .477 .477 .176 0 idf truck shipment silver of in gold fire delivery damaged arrived a Term
  21. 21. Vector model (Cont.) Hence, the ranking would be D 2 , D 3 , D 1 Document vectors Not normalized .176 0 .477 0 0 .176 0 0 0 0 0 Q .176 .176 0 0 0 .176 0 0 0 .176 0 D 3 .176 0 .954 0 0 0 0 .477 0 .176 0 D 2 0 .176 0 0 0 .176 .477 0 .477 0 0 D 1 t 11 t 10 t 9 t 8 t 7 t 6 t 5 t 4 t 3 t 2 t 1
  22. 22. Vector model (Cont.) <ul><li>Advantage </li></ul><ul><ul><li>Term-weighting scheme improves retrieval performance </li></ul></ul><ul><ul><li>Partial matching strategy allows retrieval of documents that approximate the query conditions </li></ul></ul><ul><ul><li>Cosine ranking formula sorts the documents according to their degree of similarity to the query </li></ul></ul><ul><li>Disadvantage </li></ul><ul><ul><li>Index terms are assumed to be mutually independent </li></ul></ul><ul><ul><ul><li>tf-idf scheme does not account for index term dependencies </li></ul></ul></ul><ul><ul><ul><li>However, in practice, consideration of term dependencies might be a disadvantage </li></ul></ul></ul>