Like this presentation? Why not share!

# similarity measure

## on Nov 11, 2007

• 19,803 views

A similarity measure can represent the similarity between two documents, two queries, or one document and one query

A similarity measure can represent the similarity between two documents, two queries, or one document and one query

### Views

Total Views
19,803
Views on SlideShare
19,756
Embed Views
47

Likes
5
493
0

### 1 Embed47

 http://www.slideshare.net 47

### Report content

• Comment goes here.
Are you sure you want to

## similarity measure Presentation Transcript

• Chapter 3 Similarity Measures Data Mining Technology
• Chapter 3 Similarity Measures Written by Kevin E. Heinrich Presented by Zhao Xinyou [email_address] 2007.6.7 Some materials (Examples) are taken from Website.
• Searching Process Input Text Process Index Query Sorting Show Text Result Input Text Index Query Sorting Show Text Result Input Key Words Results Search
• XXXXXXX
• YYYYYYY
• ZZZZZZZZ
• .......................
• Example
• Similarity Measures
• A similarity measure can represent the similarity between two documents , two queries , or one document and one query
• It is possible to rank the retrieved documents in the order of presumed importance
• A similarity measure is a function which computes the degree of similarity between a pair of text objects
• There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!)
PP27-28
• Classic Similarity Measures
• All similarity measures should map to the range [-1,1] or [0,1],
• 0 or -1 shows minimum similarity. (incompatible similarity)
• 1 shows maximum similarity. (absolute similarity)
PP28
• Conversion
• For example
• 1 shows incompatible similarity, 10 shows absolute similarity.
[1, 10] [0, 1] s’ = (s – 1 ) / 9 Generally, we may use : s’ = ( s – min_s ) / ( max_s – min_s ) Linear Non-linear PP28
• Vector-Space Model-VSM
• 1960s Salton etc provided VSM, which has been successfully applied on SMART (a text searching system).
PP28-29
• Example
• D is a set, which contains m Web documents; D={d 1 , d 2 ,…d i …d m } i=1,2…m
• There are n words among m Web documents. di={w i1 ,w i2 ,…w ij ,…w in } i=1,2…m ， j=1,2,….n
• Q= {q 1 ,q 2 ,…q i ,….q n } i=1,2,….n
PP28-29 If similarity(q,d i ) > similarity(q, d j ) We may get the result d i is more relevant than d j
• Simple Measure Technology Documents Set PP29 Retrieved A Relevant B Retrieved and Relevant A ∩B Precision = Returned Relevant Documents / Total Returned Documents Recall = Returned Relevant Documents / Total Relevant Documents P(A,B) = |A ∩B| / |A| R(A,B) = |A ∩B| / |B|
• Example--Simple Measure Technology PP29 Documents Set A,C,E,G, H, I, J Relevant B,D,F Retrieved & Relevant W,Y Retrieved |B| = {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10 |A| = {retrieved} = {B, D, F,W,Y} = 5 |A∩B| = {relevant} ∩ {retrieved} ={B,D,F} = 3 P = precision = 3/5 = 60% R = recall = 3/10 = 30%
• Precision-Recall Graph-Curves
• There is a tradeoff between Precision and Recall
• So measure Precision at different levels of Recall
PP29-30 One Query Two Queries Difficult to determine which of these two hypothetical results is better
• Similarity measures based on VSM
• Dice coefficient
• Overlap Coefficient
• Jaccard
• Cosine
• Asymmetric
• Dissimilarity
• Other measures
PP30
• Dice Coefficient-Cont’
• Definition of Harmonic Mean:
• To X 1 ,X 2 , …, X n ,their harmonic mean E equals n divided by(1/x 1 +1/x 2 +…+1/X n ), that is
PP30
• To Harmonic Mean ( E ) of Precision ( P ) and Recall ( R )
• Dice Coefficient-Cont’
• Denotation of Dice Coefficient:
PP30 α>0.5 : precision is more important α<0.5 : recall is more important Usually α=0.5
• Overlap Coefficient PP30-31 Documents Set A queries B Documents
• Jaccard Coefficient-Cont’ PP31 Documents Set A queries B Documents
• Example- Jaccard Coefficient
• D1 = 2T1 + 3T2 + 5T3 , (2,3,5)
• D2 = 3T1 + 7T2 + T3 , (3,7,1)
• Q = 0T1 + 0T2 + 2T3 , (0,0,2)
• J ( D1 , Q ) = 10 / (38+4-10) = 10/32 = 0.31
• J ( D2 , Q ) = 2 / (59+4-2) = 2/61 = 0.04
PP31
• Cosine Coefficient-Cont’ PP31-32 (d 21 ,d 22 ,…d 2n ) (d 11 ,d 12 ,…d 1n ) (q 1 ,q 2 ,…q n )
• Example-Cosine Coefficient
• Q = 0T1 + 0T2 + 2T3
• D1 = 2T1 + 3T2 + 5T3
• D2 = 3T1 + 7T2 + T3
C ( D1 , Q ) = = 10/ √ (38*4) = 0.81 C ( D2 , Q ) = 2 / √ (59*4) = 0.13 PP31-32 (3,7,1) (2,3,5) (0,0,2)
• Asymmetric PP31 dj di W ki -->w kj
• Euclidean distance PP32
• Manhattan block distance PP32
• Other Measures
• We may use priori/context knowledge
• For example:
• Sim(q,d j )= [content identifier similarity]+
• [objective term similarity]+
• [citation similarity]
PP32
• Comparison PP34
• Comparison Simple matching Dice’s Coefficient Cosine Coefficient Overlap Coefficient Jaccard’s Coefficient |A|+|B|-|A ∩ B| ≥(|A|+|B|)/2 |A| ≥ |A ∩ B| |B| ≥ |A∩B| (|A|+|B|)/2 ≥ √ (|A|*|B|) √ (|A|*|B|) ≥ min (|A|, |B|) O≥C≥D≥J PP34
• Example- Documents-Term-Query -Cont’
• D1:A search Engine for 3D Models
• D2:Design and Implementation of a string database query language
• D3:Ranking of documents by measures considering conceptual dependence between terms
• D4 Exploiting hierarchical domain structure to compute similarity
• D5:an approach for measuring semantic similarity between words using multiple information sources
• D6:determinging semantic similarity among entity classes from different ontologies
• D7:strong similarity measures for ordered sets of documents in information retrieval
T1: search(ing) T2: Engine(s) T3: Models T4: database T5: query T6: language T7: documents T8: measur(es,ing) T9: conceptual T10: dependence T11: domain T12: structure T13: similarity T14: semantic T15: ontologies T16: information T17: retrieval Query: Semantic similarity measures used by search engines and other information searching mechanisms PP33
• Example- Term-Document Matrix -Cont’ Matrix[q][A] PP34 1 1 1 1 1 2 Q 1 1 1 1 1 D7 1 1 1 D6 1 1 1 1 D5 1 1 1 D4 1 1 1 1 D3 1 1 1 D2 1 1 1 D1 T17 T16 T15 T14 13 T12 T11 T10 T9 T8 T7 T6 T5 T4 T3 T2 T1
• Dice coefficient PP30, PP34
• Final Results PP34 O≥C≥D≥J
• Current Applications
• Multi-Dimensional Modeling
• Hierarchical Clustering
• Bioinformatics
PP35-38
• Discussion