Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

similarity measure

24,368 views

Published on

A similarity measure can represent the similarity between two documents, two queries, or one document and one query

Published in: Technology, Education

similarity measure

  1. 1. Chapter 3 Similarity Measures Data Mining Technology
  2. 2. Chapter 3 Similarity Measures Written by Kevin E. Heinrich Presented by Zhao Xinyou [email_address] 2007.6.7 Some materials (Examples) are taken from Website.
  3. 3. Searching Process Input Text Process Index Query Sorting Show Text Result Input Text Index Query Sorting Show Text Result Input Key Words Results Search <ul><li>XXXXXXX </li></ul><ul><li>YYYYYYY </li></ul><ul><li>ZZZZZZZZ </li></ul><ul><li>....................... </li></ul>
  4. 4. Example
  5. 5. Similarity Measures <ul><li>A similarity measure can represent the similarity between two documents , two queries , or one document and one query </li></ul><ul><li>It is possible to rank the retrieved documents in the order of presumed importance </li></ul><ul><li>A similarity measure is a function which computes the degree of similarity between a pair of text objects </li></ul><ul><li>There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!) </li></ul>PP27-28
  6. 6. Classic Similarity Measures <ul><li>All similarity measures should map to the range [-1,1] or [0,1], </li></ul><ul><li>0 or -1 shows minimum similarity. (incompatible similarity) </li></ul><ul><li>1 shows maximum similarity. (absolute similarity) </li></ul>PP28
  7. 7. Conversion <ul><li>For example </li></ul><ul><li>1 shows incompatible similarity, 10 shows absolute similarity. </li></ul>[1, 10] [0, 1] s’ = (s – 1 ) / 9 Generally, we may use : s’ = ( s – min_s ) / ( max_s – min_s ) Linear Non-linear PP28
  8. 8. Vector-Space Model-VSM <ul><li>1960s Salton etc provided VSM, which has been successfully applied on SMART (a text searching system). </li></ul>PP28-29
  9. 9. Example <ul><li>D is a set, which contains m Web documents; D={d 1 , d 2 ,…d i …d m } i=1,2…m </li></ul><ul><li>There are n words among m Web documents. di={w i1 ,w i2 ,…w ij ,…w in } i=1,2…m , j=1,2,….n </li></ul><ul><li>Q= {q 1 ,q 2 ,…q i ,….q n } i=1,2,….n </li></ul>PP28-29 If similarity(q,d i ) > similarity(q, d j ) We may get the result d i is more relevant than d j
  10. 10. Simple Measure Technology Documents Set PP29 Retrieved A Relevant B Retrieved and Relevant A ∩B Precision = Returned Relevant Documents / Total Returned Documents Recall = Returned Relevant Documents / Total Relevant Documents P(A,B) = |A ∩B| / |A| R(A,B) = |A ∩B| / |B|
  11. 11. Example--Simple Measure Technology PP29 Documents Set A,C,E,G, H, I, J Relevant B,D,F Retrieved & Relevant W,Y Retrieved |B| = {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10 |A| = {retrieved} = {B, D, F,W,Y} = 5 |A∩B| = {relevant} ∩ {retrieved} ={B,D,F} = 3 P = precision = 3/5 = 60% R = recall = 3/10 = 30%
  12. 12. Precision-Recall Graph-Curves <ul><li>There is a tradeoff between Precision and Recall </li></ul><ul><li>So measure Precision at different levels of Recall </li></ul>PP29-30 One Query Two Queries Difficult to determine which of these two hypothetical results is better
  13. 13. Similarity measures based on VSM <ul><li>Dice coefficient </li></ul><ul><li>Overlap Coefficient </li></ul><ul><li>Jaccard </li></ul><ul><li>Cosine </li></ul><ul><li>Asymmetric </li></ul><ul><li>Dissimilarity </li></ul><ul><li>Other measures </li></ul>PP30
  14. 14. Dice Coefficient-Cont’ <ul><li>Definition of Harmonic Mean: </li></ul><ul><li>To X 1 ,X 2 , …, X n ,their harmonic mean E equals n divided by(1/x 1 +1/x 2 +…+1/X n ), that is </li></ul>PP30 <ul><li>To Harmonic Mean ( E ) of Precision ( P ) and Recall ( R ) </li></ul>
  15. 15. Dice Coefficient-Cont’ <ul><li>Denotation of Dice Coefficient: </li></ul>PP30 α>0.5 : precision is more important α<0.5 : recall is more important Usually α=0.5
  16. 16. Overlap Coefficient PP30-31 Documents Set A queries B Documents
  17. 17. Jaccard Coefficient-Cont’ PP31 Documents Set A queries B Documents
  18. 18. Example- Jaccard Coefficient <ul><li>D1 = 2T1 + 3T2 + 5T3 , (2,3,5) </li></ul><ul><li>D2 = 3T1 + 7T2 + T3 , (3,7,1) </li></ul><ul><li>Q = 0T1 + 0T2 + 2T3 , (0,0,2) </li></ul><ul><li>J ( D1 , Q ) = 10 / (38+4-10) = 10/32 = 0.31 </li></ul><ul><li>J ( D2 , Q ) = 2 / (59+4-2) = 2/61 = 0.04 </li></ul>PP31
  19. 19. Cosine Coefficient-Cont’ PP31-32 (d 21 ,d 22 ,…d 2n ) (d 11 ,d 12 ,…d 1n ) (q 1 ,q 2 ,…q n )
  20. 20. Example-Cosine Coefficient <ul><li>Q = 0T1 + 0T2 + 2T3 </li></ul><ul><li>D1 = 2T1 + 3T2 + 5T3 </li></ul><ul><li>D2 = 3T1 + 7T2 + T3 </li></ul>C ( D1 , Q ) = = 10/ √ (38*4) = 0.81 C ( D2 , Q ) = 2 / √ (59*4) = 0.13 PP31-32 (3,7,1) (2,3,5) (0,0,2)
  21. 21. Asymmetric PP31 dj di W ki -->w kj
  22. 22. Euclidean distance PP32
  23. 23. Manhattan block distance PP32
  24. 24. Other Measures <ul><li>We may use priori/context knowledge </li></ul><ul><li>For example: </li></ul><ul><li>Sim(q,d j )= [content identifier similarity]+ </li></ul><ul><li>[objective term similarity]+ </li></ul><ul><li>[citation similarity] </li></ul>PP32
  25. 25. Comparison PP34
  26. 26. Comparison Simple matching Dice’s Coefficient Cosine Coefficient Overlap Coefficient Jaccard’s Coefficient |A|+|B|-|A ∩ B| ≥(|A|+|B|)/2 |A| ≥ |A ∩ B| |B| ≥ |A∩B| (|A|+|B|)/2 ≥ √ (|A|*|B|) √ (|A|*|B|) ≥ min (|A|, |B|) O≥C≥D≥J PP34
  27. 27. Example- Documents-Term-Query -Cont’ <ul><li>D1:A search Engine for 3D Models </li></ul><ul><li>D2:Design and Implementation of a string database query language </li></ul><ul><li>D3:Ranking of documents by measures considering conceptual dependence between terms </li></ul><ul><li>D4 Exploiting hierarchical domain structure to compute similarity </li></ul><ul><li>D5:an approach for measuring semantic similarity between words using multiple information sources </li></ul><ul><li>D6:determinging semantic similarity among entity classes from different ontologies </li></ul><ul><li>D7:strong similarity measures for ordered sets of documents in information retrieval </li></ul>T1: search(ing) T2: Engine(s) T3: Models T4: database T5: query T6: language T7: documents T8: measur(es,ing) T9: conceptual T10: dependence T11: domain T12: structure T13: similarity T14: semantic T15: ontologies T16: information T17: retrieval Query: Semantic similarity measures used by search engines and other information searching mechanisms PP33
  28. 28. Example- Term-Document Matrix -Cont’ Matrix[q][A] PP34 1 1 1 1 1 2 Q 1 1 1 1 1 D7 1 1 1 D6 1 1 1 1 D5 1 1 1 D4 1 1 1 1 D3 1 1 1 D2 1 1 1 D1 T17 T16 T15 T14 13 T12 T11 T10 T9 T8 T7 T6 T5 T4 T3 T2 T1
  29. 29. Dice coefficient PP30, PP34
  30. 30. Final Results PP34 O≥C≥D≥J
  31. 31. Current Applications <ul><li>Multi-Dimensional Modeling </li></ul><ul><li>Hierarchical Clustering </li></ul><ul><li>Bioinformatics </li></ul>PP35-38
  32. 32. <ul><li>Discussion </li></ul>

×