Your SlideShare is downloading. ×
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Vsm 벡터공간모델
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Vsm 벡터공간모델

1,541

Published on

정보검색시스템 강의노트 강승식교수님

정보검색시스템 강의노트 강승식교수님

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,541
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Chapter 2 Modeling
  • 2. Contents
    • Introduction
    • Taxonomy of IR Models
    • Retrieval : Ad hoc, Filtering
    • Formal Characterization of IR Models
    • Classic IR Models
    • Alternative Set Theoretic Models
    • Alternative Algebraic Models
    • Alternative Probabilistic Models
  • 3. Contents (Cont.)
    • Structured Text Retrieval Models
    • Models for Browsing
    • Trends and Research Issues
  • 4. 2.1 Introduction
    • Traditional IR System
      • Adopt index terms to index and retrieve documents
    • Index Term
      • Restricted sense
        • Keyword which has some meaning of its own (usually noun)
      • General form
        • Any word which appears in the text of a document
    • Ranking Algorithm
      • Attempt to establish a simple ordering of the documents retrieved
      • Operate according to basic premises regarding the notion of document relevance
  • 5. 2.2 A Taxonomy of IR Models Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network U s e r T a s k Retrieval: Ad hoc Filtering Browsing Classic Models boolean vector probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Flat Structure Guided Hypertext
  • 6. A Taxonomy of IR Models (Cont.)
    • Retrieval models
      • Most frequently associated with distinct combinations of a document logical view and a user task
    Logical View of Documents U S E R T A S K Structure Guided Hypertext Flat Hypertext Flat Browsing Structured Classic Set theoretic Algebraic Probabilistic Classic Set theoretic Algebraic Probabilistic Retrieval Full Text + Structure Full Text Index Terms
  • 7. 2.3 Retrieval
    • Ad hoc
      • The documents in the collection remain relatively static while new queries are submitted to the system
      • The most common form of user task
    • Filtering
      • The queries remain relatively static while new documents come into the system (and leave)
      • User profile
        • Describing the user’s preferences
      • Routing (variation of filtering, rank the filtered document)
  • 8. 2.4 A Formal Characterization of IR Models
    • IR Model
  • 9. 2.5 Classic Information Retrieval
    • Boolean Model
      • Based on set theory and Boolean algebra
      • Queries are specified as Boolean expressions
      • Model considers that index terms are present or absent in a document
    • Vector Model
      • Partial matching is possible
      • Assign non-binary weights to index terms
      • Term weights are used to compute the degree of similarity
    • Probabilistic Model
      • Given a query, the model assigns each document d j , as a measure of similarity to the query, p( d j relevant to q )/p( d j non-relevant to q ) which computes the odds of the document d j being relevant to the query q
  • 10. 2.5.1 Basic Concepts
    • Index Term
      • Word whose semantics helps in remembering the document’s main themes
      • Mainly nouns
        • Nouns have meaning by themselves
      • Weights
        • All terms are not equally useful for describing the document
      • Definition
  • 11. Basic Concepts (Cont.)
    • Mutual Independence
      • Index term weights are usually assumed to be mutually independent
      • Knowing the weight w ij associated with the pair ( k i , d j ) tells us nothing about the weight w (i+1)j associated with the pair ( k i+1 , d j )
      • It does simplify the task of computing index term weights and allows for fast ranking computation
  • 12. 2.5.2 Boolean Model
    • Base
      • Simple retrieval model based on Set theory and Boolean algebra
      • Operation : and, or, not
    • Advantage
      • Clean formalism
      • Boolean query expressions have precise semantics
    • Disadvantage
      • Binary decision (no notion of a partial match)
        • Retrieval of too few or too many document
      • Difficult to express their query requests in terms of Boolean expressions
  • 13. Boolean Model (Cont.)
    • Definition
    • Example
    k a k b k c
  • 14. Boolean Model (Cont.) 병렬 프로그램 시스템 1 1 0 … 0 1 1 … 0 0 1 … 1 0 1 … 병렬 프로그램 시스템 … 색인어 1 0 0 1 유사도 004 003 002 001 문서
  • 15. 2.5.3 Vector model
    • Motivation
      • Binary weights is too limiting
        • Assign non-binary weights to index terms
      • A framework in which partial matching is possible
        • Instead of attempting to predict whether a document is relevant or not
        • Rank the documents according to their degree of similarity to the query
  • 16. Vector model (Cont.)
    • Definition
  • 17. Vector model (Cont.)
    • Clustering Problem
      • Intra-cluster similarity
        • What are the features which better describe the objects
      • Inter-cluster similarity
        • What are the features which better distinguish the objects
    • IR Problem
      • Intra-cluster similarity ( tf factor)
        • Raw frequency of a term k i inside a document d j
      • Inter-cluster similarity ( idf factor)
        • Inverse of the frequency of a term k i among the documents
  • 18. Vector model (Cont.)
    • Weighting Scheme
      • Term Frequency ( tf )
        • Measure of how well that term describes the document contents
      • Inverse Document Frequency ( idf )
        • Terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one
  • 19. Vector model (Cont.)
    • Best known index term weighting scheme
      • Balance tf and idf ( tf-idf scheme)
    • Query term weighting scheme
  • 20. Vector model (Cont.) .176 .176 .477 0 0 .176 .477 .477 .477 .176 0 idf truck shipment silver of in gold fire delivery damaged arrived a Term
  • 21. Vector model (Cont.) Hence, the ranking would be D 2 , D 3 , D 1 Document vectors Not normalized .176 0 .477 0 0 .176 0 0 0 0 0 Q .176 .176 0 0 0 .176 0 0 0 .176 0 D 3 .176 0 .954 0 0 0 0 .477 0 .176 0 D 2 0 .176 0 0 0 .176 .477 0 .477 0 0 D 1 t 11 t 10 t 9 t 8 t 7 t 6 t 5 t 4 t 3 t 2 t 1
  • 22. Vector model (Cont.)
    • Advantage
      • Term-weighting scheme improves retrieval performance
      • Partial matching strategy allows retrieval of documents that approximate the query conditions
      • Cosine ranking formula sorts the documents according to their degree of similarity to the query
    • Disadvantage
      • Index terms are assumed to be mutually independent
        • tf-idf scheme does not account for index term dependencies
        • However, in practice, consideration of term dependencies might be a disadvantage

×