similarity measure
Upcoming SlideShare
Loading in...5
×
 

similarity measure

on

  • 19,929 views

A similarity measure can represent the similarity between two documents, two queries, or one document and one query

A similarity measure can represent the similarity between two documents, two queries, or one document and one query

Statistics

Views

Total Views
19,929
Views on SlideShare
19,882
Embed Views
47

Actions

Likes
5
Downloads
495
Comments
0

1 Embed 47

http://www.slideshare.net 47

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

similarity measure similarity measure Presentation Transcript

  • Chapter 3 Similarity Measures Data Mining Technology
  • Chapter 3 Similarity Measures Written by Kevin E. Heinrich Presented by Zhao Xinyou [email_address] 2007.6.7 Some materials (Examples) are taken from Website.
  • Searching Process Input Text Process Index Query Sorting Show Text Result Input Text Index Query Sorting Show Text Result Input Key Words Results Search
    • XXXXXXX
    • YYYYYYY
    • ZZZZZZZZ
    • .......................
  • Example
  • Similarity Measures
    • A similarity measure can represent the similarity between two documents , two queries , or one document and one query
    • It is possible to rank the retrieved documents in the order of presumed importance
    • A similarity measure is a function which computes the degree of similarity between a pair of text objects
    • There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!)
    PP27-28
  • Classic Similarity Measures
    • All similarity measures should map to the range [-1,1] or [0,1],
    • 0 or -1 shows minimum similarity. (incompatible similarity)
    • 1 shows maximum similarity. (absolute similarity)
    PP28
  • Conversion
    • For example
    • 1 shows incompatible similarity, 10 shows absolute similarity.
    [1, 10] [0, 1] s’ = (s – 1 ) / 9 Generally, we may use : s’ = ( s – min_s ) / ( max_s – min_s ) Linear Non-linear PP28
  • Vector-Space Model-VSM
    • 1960s Salton etc provided VSM, which has been successfully applied on SMART (a text searching system).
    PP28-29
  • Example
    • D is a set, which contains m Web documents; D={d 1 , d 2 ,…d i …d m } i=1,2…m
    • There are n words among m Web documents. di={w i1 ,w i2 ,…w ij ,…w in } i=1,2…m , j=1,2,….n
    • Q= {q 1 ,q 2 ,…q i ,….q n } i=1,2,….n
    PP28-29 If similarity(q,d i ) > similarity(q, d j ) We may get the result d i is more relevant than d j
  • Simple Measure Technology Documents Set PP29 Retrieved A Relevant B Retrieved and Relevant A ∩B Precision = Returned Relevant Documents / Total Returned Documents Recall = Returned Relevant Documents / Total Relevant Documents P(A,B) = |A ∩B| / |A| R(A,B) = |A ∩B| / |B|
  • Example--Simple Measure Technology PP29 Documents Set A,C,E,G, H, I, J Relevant B,D,F Retrieved & Relevant W,Y Retrieved |B| = {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10 |A| = {retrieved} = {B, D, F,W,Y} = 5 |A∩B| = {relevant} ∩ {retrieved} ={B,D,F} = 3 P = precision = 3/5 = 60% R = recall = 3/10 = 30%
  • Precision-Recall Graph-Curves
    • There is a tradeoff between Precision and Recall
    • So measure Precision at different levels of Recall
    PP29-30 One Query Two Queries Difficult to determine which of these two hypothetical results is better
  • Similarity measures based on VSM
    • Dice coefficient
    • Overlap Coefficient
    • Jaccard
    • Cosine
    • Asymmetric
    • Dissimilarity
    • Other measures
    PP30
  • Dice Coefficient-Cont’
    • Definition of Harmonic Mean:
    • To X 1 ,X 2 , …, X n ,their harmonic mean E equals n divided by(1/x 1 +1/x 2 +…+1/X n ), that is
    PP30
    • To Harmonic Mean ( E ) of Precision ( P ) and Recall ( R )
  • Dice Coefficient-Cont’
    • Denotation of Dice Coefficient:
    PP30 α>0.5 : precision is more important α<0.5 : recall is more important Usually α=0.5
  • Overlap Coefficient PP30-31 Documents Set A queries B Documents
  • Jaccard Coefficient-Cont’ PP31 Documents Set A queries B Documents
  • Example- Jaccard Coefficient
    • D1 = 2T1 + 3T2 + 5T3 , (2,3,5)
    • D2 = 3T1 + 7T2 + T3 , (3,7,1)
    • Q = 0T1 + 0T2 + 2T3 , (0,0,2)
    • J ( D1 , Q ) = 10 / (38+4-10) = 10/32 = 0.31
    • J ( D2 , Q ) = 2 / (59+4-2) = 2/61 = 0.04
    PP31
  • Cosine Coefficient-Cont’ PP31-32 (d 21 ,d 22 ,…d 2n ) (d 11 ,d 12 ,…d 1n ) (q 1 ,q 2 ,…q n )
  • Example-Cosine Coefficient
    • Q = 0T1 + 0T2 + 2T3
    • D1 = 2T1 + 3T2 + 5T3
    • D2 = 3T1 + 7T2 + T3
    C ( D1 , Q ) = = 10/ √ (38*4) = 0.81 C ( D2 , Q ) = 2 / √ (59*4) = 0.13 PP31-32 (3,7,1) (2,3,5) (0,0,2)
  • Asymmetric PP31 dj di W ki -->w kj
  • Euclidean distance PP32
  • Manhattan block distance PP32
  • Other Measures
    • We may use priori/context knowledge
    • For example:
    • Sim(q,d j )= [content identifier similarity]+
    • [objective term similarity]+
    • [citation similarity]
    PP32
  • Comparison PP34
  • Comparison Simple matching Dice’s Coefficient Cosine Coefficient Overlap Coefficient Jaccard’s Coefficient |A|+|B|-|A ∩ B| ≥(|A|+|B|)/2 |A| ≥ |A ∩ B| |B| ≥ |A∩B| (|A|+|B|)/2 ≥ √ (|A|*|B|) √ (|A|*|B|) ≥ min (|A|, |B|) O≥C≥D≥J PP34
  • Example- Documents-Term-Query -Cont’
    • D1:A search Engine for 3D Models
    • D2:Design and Implementation of a string database query language
    • D3:Ranking of documents by measures considering conceptual dependence between terms
    • D4 Exploiting hierarchical domain structure to compute similarity
    • D5:an approach for measuring semantic similarity between words using multiple information sources
    • D6:determinging semantic similarity among entity classes from different ontologies
    • D7:strong similarity measures for ordered sets of documents in information retrieval
    T1: search(ing) T2: Engine(s) T3: Models T4: database T5: query T6: language T7: documents T8: measur(es,ing) T9: conceptual T10: dependence T11: domain T12: structure T13: similarity T14: semantic T15: ontologies T16: information T17: retrieval Query: Semantic similarity measures used by search engines and other information searching mechanisms PP33
  • Example- Term-Document Matrix -Cont’ Matrix[q][A] PP34 1 1 1 1 1 2 Q 1 1 1 1 1 D7 1 1 1 D6 1 1 1 1 D5 1 1 1 D4 1 1 1 1 D3 1 1 1 D2 1 1 1 D1 T17 T16 T15 T14 13 T12 T11 T10 T9 T8 T7 T6 T5 T4 T3 T2 T1
  • Dice coefficient PP30, PP34
  • Final Results PP34 O≥C≥D≥J
  • Current Applications
    • Multi-Dimensional Modeling
    • Hierarchical Clustering
    • Bioinformatics
    PP35-38
    • Discussion