• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Vsm 벡터공간모델
 

Vsm 벡터공간모델

on

  • 661 views

정보검색시스템 강의노트_강승식교수님

정보검색시스템 강의노트_강승식교수님

Statistics

Views

Total Views
661
Views on SlideShare
661
Embed Views
0

Actions

Likes
0
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Vsm 벡터공간모델 Vsm 벡터공간모델 Presentation Transcript

    • Chapter 2 Modeling
    • Contents
      • Introduction
      • Taxonomy of IR Models
      • Retrieval : Ad hoc, Filtering
      • Formal Characterization of IR Models
      • Classic IR Models
      • Alternative Set Theoretic Models
      • Alternative Algebraic Models
      • Alternative Probabilistic Models
    • Contents (Cont.)
      • Structured Text Retrieval Models
      • Models for Browsing
      • Trends and Research Issues
    • 2.1 Introduction
      • Traditional IR System
        • Adopt index terms to index and retrieve documents
      • Index Term
        • Restricted sense
          • Keyword which has some meaning of its own (usually noun)
        • General form
          • Any word which appears in the text of a document
      • Ranking Algorithm
        • Attempt to establish a simple ordering of the documents retrieved
        • Operate according to basic premises regarding the notion of document relevance
    • 2.2 A Taxonomy of IR Models Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network U s e r T a s k Retrieval: Ad hoc Filtering Browsing Classic Models boolean vector probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Flat Structure Guided Hypertext
    • A Taxonomy of IR Models (Cont.)
      • Retrieval models
        • Most frequently associated with distinct combinations of a document logical view and a user task
      Logical View of Documents U S E R T A S K Structure Guided Hypertext Flat Hypertext Flat Browsing Structured Classic Set theoretic Algebraic Probabilistic Classic Set theoretic Algebraic Probabilistic Retrieval Full Text + Structure Full Text Index Terms
    • 2.3 Retrieval
      • Ad hoc
        • The documents in the collection remain relatively static while new queries are submitted to the system
        • The most common form of user task
      • Filtering
        • The queries remain relatively static while new documents come into the system (and leave)
        • User profile
          • Describing the user’s preferences
        • Routing (variation of filtering, rank the filtered document)
    • 2.4 A Formal Characterization of IR Models
      • IR Model
    • 2.5 Classic Information Retrieval
      • Boolean Model
        • Based on set theory and Boolean algebra
        • Queries are specified as Boolean expressions
        • Model considers that index terms are present or absent in a document
      • Vector Model
        • Partial matching is possible
        • Assign non-binary weights to index terms
        • Term weights are used to compute the degree of similarity
      • Probabilistic Model
        • Given a query, the model assigns each document d j , as a measure of similarity to the query, p( d j relevant to q )/p( d j non-relevant to q ) which computes the odds of the document d j being relevant to the query q
    • 2.5.1 Basic Concepts
      • Index Term
        • Word whose semantics helps in remembering the document’s main themes
        • Mainly nouns
          • Nouns have meaning by themselves
        • Weights
          • All terms are not equally useful for describing the document
        • Definition
    • Basic Concepts (Cont.)
      • Mutual Independence
        • Index term weights are usually assumed to be mutually independent
        • Knowing the weight w ij associated with the pair ( k i , d j ) tells us nothing about the weight w (i+1)j associated with the pair ( k i+1 , d j )
        • It does simplify the task of computing index term weights and allows for fast ranking computation
    • 2.5.2 Boolean Model
      • Base
        • Simple retrieval model based on Set theory and Boolean algebra
        • Operation : and, or, not
      • Advantage
        • Clean formalism
        • Boolean query expressions have precise semantics
      • Disadvantage
        • Binary decision (no notion of a partial match)
          • Retrieval of too few or too many document
        • Difficult to express their query requests in terms of Boolean expressions
    • Boolean Model (Cont.)
      • Definition
      • Example
      k a k b k c
    • Boolean Model (Cont.) 병렬 프로그램 시스템 1 1 0 … 0 1 1 … 0 0 1 … 1 0 1 … 병렬 프로그램 시스템 … 색인어 1 0 0 1 유사도 004 003 002 001 문서
    • 2.5.3 Vector model
      • Motivation
        • Binary weights is too limiting
          • Assign non-binary weights to index terms
        • A framework in which partial matching is possible
          • Instead of attempting to predict whether a document is relevant or not
          • Rank the documents according to their degree of similarity to the query
    • Vector model (Cont.)
      • Definition
    • Vector model (Cont.)
      • Clustering Problem
        • Intra-cluster similarity
          • What are the features which better describe the objects
        • Inter-cluster similarity
          • What are the features which better distinguish the objects
      • IR Problem
        • Intra-cluster similarity ( tf factor)
          • Raw frequency of a term k i inside a document d j
        • Inter-cluster similarity ( idf factor)
          • Inverse of the frequency of a term k i among the documents
    • Vector model (Cont.)
      • Weighting Scheme
        • Term Frequency ( tf )
          • Measure of how well that term describes the document contents
        • Inverse Document Frequency ( idf )
          • Terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one
    • Vector model (Cont.)
      • Best known index term weighting scheme
        • Balance tf and idf ( tf-idf scheme)
      • Query term weighting scheme
    • Vector model (Cont.) .176 .176 .477 0 0 .176 .477 .477 .477 .176 0 idf truck shipment silver of in gold fire delivery damaged arrived a Term
    • Vector model (Cont.) Hence, the ranking would be D 2 , D 3 , D 1 Document vectors Not normalized .176 0 .477 0 0 .176 0 0 0 0 0 Q .176 .176 0 0 0 .176 0 0 0 .176 0 D 3 .176 0 .954 0 0 0 0 .477 0 .176 0 D 2 0 .176 0 0 0 .176 .477 0 .477 0 0 D 1 t 11 t 10 t 9 t 8 t 7 t 6 t 5 t 4 t 3 t 2 t 1
    • Vector model (Cont.)
      • Advantage
        • Term-weighting scheme improves retrieval performance
        • Partial matching strategy allows retrieval of documents that approximate the query conditions
        • Cosine ranking formula sorts the documents according to their degree of similarity to the query
      • Disadvantage
        • Index terms are assumed to be mutually independent
          • tf-idf scheme does not account for index term dependencies
          • However, in practice, consideration of term dependencies might be a disadvantage