Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Vsm 벡터공간모델


Published on

정보검색시스템 강의노트_강승식교수님

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Vsm 벡터공간모델

  1. 1. Chapter 2 Modeling
  2. 2. Contents <ul><li>Introduction </li></ul><ul><li>Taxonomy of IR Models </li></ul><ul><li>Retrieval : Ad hoc, Filtering </li></ul><ul><li>Formal Characterization of IR Models </li></ul><ul><li>Classic IR Models </li></ul><ul><li>Alternative Set Theoretic Models </li></ul><ul><li>Alternative Algebraic Models </li></ul><ul><li>Alternative Probabilistic Models </li></ul>
  3. 3. Contents (Cont.) <ul><li>Structured Text Retrieval Models </li></ul><ul><li>Models for Browsing </li></ul><ul><li>Trends and Research Issues </li></ul>
  4. 4. 2.1 Introduction <ul><li>Traditional IR System </li></ul><ul><ul><li>Adopt index terms to index and retrieve documents </li></ul></ul><ul><li>Index Term </li></ul><ul><ul><li>Restricted sense </li></ul></ul><ul><ul><ul><li>Keyword which has some meaning of its own (usually noun) </li></ul></ul></ul><ul><ul><li>General form </li></ul></ul><ul><ul><ul><li>Any word which appears in the text of a document </li></ul></ul></ul><ul><li>Ranking Algorithm </li></ul><ul><ul><li>Attempt to establish a simple ordering of the documents retrieved </li></ul></ul><ul><ul><li>Operate according to basic premises regarding the notion of document relevance </li></ul></ul>
  5. 5. 2.2 A Taxonomy of IR Models Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network U s e r T a s k Retrieval: Ad hoc Filtering Browsing Classic Models boolean vector probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Flat Structure Guided Hypertext
  6. 6. A Taxonomy of IR Models (Cont.) <ul><li>Retrieval models </li></ul><ul><ul><li>Most frequently associated with distinct combinations of a document logical view and a user task </li></ul></ul>Logical View of Documents U S E R T A S K Structure Guided Hypertext Flat Hypertext Flat Browsing Structured Classic Set theoretic Algebraic Probabilistic Classic Set theoretic Algebraic Probabilistic Retrieval Full Text + Structure Full Text Index Terms
  7. 7. 2.3 Retrieval <ul><li>Ad hoc </li></ul><ul><ul><li>The documents in the collection remain relatively static while new queries are submitted to the system </li></ul></ul><ul><ul><li>The most common form of user task </li></ul></ul><ul><li>Filtering </li></ul><ul><ul><li>The queries remain relatively static while new documents come into the system (and leave) </li></ul></ul><ul><ul><li>User profile </li></ul></ul><ul><ul><ul><li>Describing the user’s preferences </li></ul></ul></ul><ul><ul><li>Routing (variation of filtering, rank the filtered document) </li></ul></ul>
  8. 8. 2.4 A Formal Characterization of IR Models <ul><li>IR Model </li></ul>
  9. 9. 2.5 Classic Information Retrieval <ul><li>Boolean Model </li></ul><ul><ul><li>Based on set theory and Boolean algebra </li></ul></ul><ul><ul><li>Queries are specified as Boolean expressions </li></ul></ul><ul><ul><li>Model considers that index terms are present or absent in a document </li></ul></ul><ul><li>Vector Model </li></ul><ul><ul><li>Partial matching is possible </li></ul></ul><ul><ul><li>Assign non-binary weights to index terms </li></ul></ul><ul><ul><li>Term weights are used to compute the degree of similarity </li></ul></ul><ul><li>Probabilistic Model </li></ul><ul><ul><li>Given a query, the model assigns each document d j , as a measure of similarity to the query, p( d j relevant to q )/p( d j non-relevant to q ) which computes the odds of the document d j being relevant to the query q </li></ul></ul>
  10. 10. 2.5.1 Basic Concepts <ul><li>Index Term </li></ul><ul><ul><li>Word whose semantics helps in remembering the document’s main themes </li></ul></ul><ul><ul><li>Mainly nouns </li></ul></ul><ul><ul><ul><li>Nouns have meaning by themselves </li></ul></ul></ul><ul><ul><li>Weights </li></ul></ul><ul><ul><ul><li>All terms are not equally useful for describing the document </li></ul></ul></ul><ul><ul><li>Definition </li></ul></ul>
  11. 11. Basic Concepts (Cont.) <ul><li>Mutual Independence </li></ul><ul><ul><li>Index term weights are usually assumed to be mutually independent </li></ul></ul><ul><ul><li>Knowing the weight w ij associated with the pair ( k i , d j ) tells us nothing about the weight w (i+1)j associated with the pair ( k i+1 , d j ) </li></ul></ul><ul><ul><li>It does simplify the task of computing index term weights and allows for fast ranking computation </li></ul></ul>
  12. 12. 2.5.2 Boolean Model <ul><li>Base </li></ul><ul><ul><li>Simple retrieval model based on Set theory and Boolean algebra </li></ul></ul><ul><ul><li>Operation : and, or, not </li></ul></ul><ul><li>Advantage </li></ul><ul><ul><li>Clean formalism </li></ul></ul><ul><ul><li>Boolean query expressions have precise semantics </li></ul></ul><ul><li>Disadvantage </li></ul><ul><ul><li>Binary decision (no notion of a partial match) </li></ul></ul><ul><ul><ul><li>Retrieval of too few or too many document </li></ul></ul></ul><ul><ul><li>Difficult to express their query requests in terms of Boolean expressions </li></ul></ul>
  13. 13. Boolean Model (Cont.) <ul><li>Definition </li></ul><ul><li>Example </li></ul>k a k b k c
  14. 14. Boolean Model (Cont.) 병렬 프로그램 시스템 1 1 0 … 0 1 1 … 0 0 1 … 1 0 1 … 병렬 프로그램 시스템 … 색인어 1 0 0 1 유사도 004 003 002 001 문서
  15. 15. 2.5.3 Vector model <ul><li>Motivation </li></ul><ul><ul><li>Binary weights is too limiting </li></ul></ul><ul><ul><ul><li>Assign non-binary weights to index terms </li></ul></ul></ul><ul><ul><li>A framework in which partial matching is possible </li></ul></ul><ul><ul><ul><li>Instead of attempting to predict whether a document is relevant or not </li></ul></ul></ul><ul><ul><ul><li>Rank the documents according to their degree of similarity to the query </li></ul></ul></ul>
  16. 16. Vector model (Cont.) <ul><li>Definition </li></ul>
  17. 17. Vector model (Cont.) <ul><li>Clustering Problem </li></ul><ul><ul><li>Intra-cluster similarity </li></ul></ul><ul><ul><ul><li>What are the features which better describe the objects </li></ul></ul></ul><ul><ul><li>Inter-cluster similarity </li></ul></ul><ul><ul><ul><li>What are the features which better distinguish the objects </li></ul></ul></ul><ul><li>IR Problem </li></ul><ul><ul><li>Intra-cluster similarity ( tf factor) </li></ul></ul><ul><ul><ul><li>Raw frequency of a term k i inside a document d j </li></ul></ul></ul><ul><ul><li>Inter-cluster similarity ( idf factor) </li></ul></ul><ul><ul><ul><li>Inverse of the frequency of a term k i among the documents </li></ul></ul></ul>
  18. 18. Vector model (Cont.) <ul><li>Weighting Scheme </li></ul><ul><ul><li>Term Frequency ( tf ) </li></ul></ul><ul><ul><ul><li>Measure of how well that term describes the document contents </li></ul></ul></ul><ul><ul><li>Inverse Document Frequency ( idf ) </li></ul></ul><ul><ul><ul><li>Terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one </li></ul></ul></ul>
  19. 19. Vector model (Cont.) <ul><li>Best known index term weighting scheme </li></ul><ul><ul><li>Balance tf and idf ( tf-idf scheme) </li></ul></ul><ul><li>Query term weighting scheme </li></ul>
  20. 20. Vector model (Cont.) .176 .176 .477 0 0 .176 .477 .477 .477 .176 0 idf truck shipment silver of in gold fire delivery damaged arrived a Term
  21. 21. Vector model (Cont.) Hence, the ranking would be D 2 , D 3 , D 1 Document vectors Not normalized .176 0 .477 0 0 .176 0 0 0 0 0 Q .176 .176 0 0 0 .176 0 0 0 .176 0 D 3 .176 0 .954 0 0 0 0 .477 0 .176 0 D 2 0 .176 0 0 0 .176 .477 0 .477 0 0 D 1 t 11 t 10 t 9 t 8 t 7 t 6 t 5 t 4 t 3 t 2 t 1
  22. 22. Vector model (Cont.) <ul><li>Advantage </li></ul><ul><ul><li>Term-weighting scheme improves retrieval performance </li></ul></ul><ul><ul><li>Partial matching strategy allows retrieval of documents that approximate the query conditions </li></ul></ul><ul><ul><li>Cosine ranking formula sorts the documents according to their degree of similarity to the query </li></ul></ul><ul><li>Disadvantage </li></ul><ul><ul><li>Index terms are assumed to be mutually independent </li></ul></ul><ul><ul><ul><li>tf-idf scheme does not account for index term dependencies </li></ul></ul></ul><ul><ul><ul><li>However, in practice, consideration of term dependencies might be a disadvantage </li></ul></ul></ul>