Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

similarity measure

28,316 views

Published on

A similarity measure can represent the similarity between two documents, two queries, or one document and one query

Published in: Technology, Education
  • Dating for everyone is here: ❤❤❤ http://bit.ly/369VOVb ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❤❤❤ http://bit.ly/369VOVb ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get Now to Read eBook === http://bestadaododadj.justdied.com/2897674342-echec-les-archer-d-avalon-t2.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get HERE to Read This eBook === http://bestadaododadj.justdied.com/B014SRXWXW-love-mission-t15.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

similarity measure

  1. 1. Chapter 3 Similarity Measures Data Mining Technology
  2. 2. Chapter 3 Similarity Measures Written by Kevin E. Heinrich Presented by Zhao Xinyou [email_address] 2007.6.7 Some materials (Examples) are taken from Website.
  3. 3. Searching Process Input Text Process Index Query Sorting Show Text Result Input Text Index Query Sorting Show Text Result Input Key Words Results Search <ul><li>XXXXXXX </li></ul><ul><li>YYYYYYY </li></ul><ul><li>ZZZZZZZZ </li></ul><ul><li>....................... </li></ul>
  4. 4. Example
  5. 5. Similarity Measures <ul><li>A similarity measure can represent the similarity between two documents , two queries , or one document and one query </li></ul><ul><li>It is possible to rank the retrieved documents in the order of presumed importance </li></ul><ul><li>A similarity measure is a function which computes the degree of similarity between a pair of text objects </li></ul><ul><li>There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!) </li></ul>PP27-28
  6. 6. Classic Similarity Measures <ul><li>All similarity measures should map to the range [-1,1] or [0,1], </li></ul><ul><li>0 or -1 shows minimum similarity. (incompatible similarity) </li></ul><ul><li>1 shows maximum similarity. (absolute similarity) </li></ul>PP28
  7. 7. Conversion <ul><li>For example </li></ul><ul><li>1 shows incompatible similarity, 10 shows absolute similarity. </li></ul>[1, 10] [0, 1] s’ = (s – 1 ) / 9 Generally, we may use : s’ = ( s – min_s ) / ( max_s – min_s ) Linear Non-linear PP28
  8. 8. Vector-Space Model-VSM <ul><li>1960s Salton etc provided VSM, which has been successfully applied on SMART (a text searching system). </li></ul>PP28-29
  9. 9. Example <ul><li>D is a set, which contains m Web documents; D={d 1 , d 2 ,…d i …d m } i=1,2…m </li></ul><ul><li>There are n words among m Web documents. di={w i1 ,w i2 ,…w ij ,…w in } i=1,2…m , j=1,2,….n </li></ul><ul><li>Q= {q 1 ,q 2 ,…q i ,….q n } i=1,2,….n </li></ul>PP28-29 If similarity(q,d i ) > similarity(q, d j ) We may get the result d i is more relevant than d j
  10. 10. Simple Measure Technology Documents Set PP29 Retrieved A Relevant B Retrieved and Relevant A ∩B Precision = Returned Relevant Documents / Total Returned Documents Recall = Returned Relevant Documents / Total Relevant Documents P(A,B) = |A ∩B| / |A| R(A,B) = |A ∩B| / |B|
  11. 11. Example--Simple Measure Technology PP29 Documents Set A,C,E,G, H, I, J Relevant B,D,F Retrieved & Relevant W,Y Retrieved |B| = {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10 |A| = {retrieved} = {B, D, F,W,Y} = 5 |A∩B| = {relevant} ∩ {retrieved} ={B,D,F} = 3 P = precision = 3/5 = 60% R = recall = 3/10 = 30%
  12. 12. Precision-Recall Graph-Curves <ul><li>There is a tradeoff between Precision and Recall </li></ul><ul><li>So measure Precision at different levels of Recall </li></ul>PP29-30 One Query Two Queries Difficult to determine which of these two hypothetical results is better
  13. 13. Similarity measures based on VSM <ul><li>Dice coefficient </li></ul><ul><li>Overlap Coefficient </li></ul><ul><li>Jaccard </li></ul><ul><li>Cosine </li></ul><ul><li>Asymmetric </li></ul><ul><li>Dissimilarity </li></ul><ul><li>Other measures </li></ul>PP30
  14. 14. Dice Coefficient-Cont’ <ul><li>Definition of Harmonic Mean: </li></ul><ul><li>To X 1 ,X 2 , …, X n ,their harmonic mean E equals n divided by(1/x 1 +1/x 2 +…+1/X n ), that is </li></ul>PP30 <ul><li>To Harmonic Mean ( E ) of Precision ( P ) and Recall ( R ) </li></ul>
  15. 15. Dice Coefficient-Cont’ <ul><li>Denotation of Dice Coefficient: </li></ul>PP30 α>0.5 : precision is more important α<0.5 : recall is more important Usually α=0.5
  16. 16. Overlap Coefficient PP30-31 Documents Set A queries B Documents
  17. 17. Jaccard Coefficient-Cont’ PP31 Documents Set A queries B Documents
  18. 18. Example- Jaccard Coefficient <ul><li>D1 = 2T1 + 3T2 + 5T3 , (2,3,5) </li></ul><ul><li>D2 = 3T1 + 7T2 + T3 , (3,7,1) </li></ul><ul><li>Q = 0T1 + 0T2 + 2T3 , (0,0,2) </li></ul><ul><li>J ( D1 , Q ) = 10 / (38+4-10) = 10/32 = 0.31 </li></ul><ul><li>J ( D2 , Q ) = 2 / (59+4-2) = 2/61 = 0.04 </li></ul>PP31
  19. 19. Cosine Coefficient-Cont’ PP31-32 (d 21 ,d 22 ,…d 2n ) (d 11 ,d 12 ,…d 1n ) (q 1 ,q 2 ,…q n )
  20. 20. Example-Cosine Coefficient <ul><li>Q = 0T1 + 0T2 + 2T3 </li></ul><ul><li>D1 = 2T1 + 3T2 + 5T3 </li></ul><ul><li>D2 = 3T1 + 7T2 + T3 </li></ul>C ( D1 , Q ) = = 10/ √ (38*4) = 0.81 C ( D2 , Q ) = 2 / √ (59*4) = 0.13 PP31-32 (3,7,1) (2,3,5) (0,0,2)
  21. 21. Asymmetric PP31 dj di W ki -->w kj
  22. 22. Euclidean distance PP32
  23. 23. Manhattan block distance PP32
  24. 24. Other Measures <ul><li>We may use priori/context knowledge </li></ul><ul><li>For example: </li></ul><ul><li>Sim(q,d j )= [content identifier similarity]+ </li></ul><ul><li>[objective term similarity]+ </li></ul><ul><li>[citation similarity] </li></ul>PP32
  25. 25. Comparison PP34
  26. 26. Comparison Simple matching Dice’s Coefficient Cosine Coefficient Overlap Coefficient Jaccard’s Coefficient |A|+|B|-|A ∩ B| ≥(|A|+|B|)/2 |A| ≥ |A ∩ B| |B| ≥ |A∩B| (|A|+|B|)/2 ≥ √ (|A|*|B|) √ (|A|*|B|) ≥ min (|A|, |B|) O≥C≥D≥J PP34
  27. 27. Example- Documents-Term-Query -Cont’ <ul><li>D1:A search Engine for 3D Models </li></ul><ul><li>D2:Design and Implementation of a string database query language </li></ul><ul><li>D3:Ranking of documents by measures considering conceptual dependence between terms </li></ul><ul><li>D4 Exploiting hierarchical domain structure to compute similarity </li></ul><ul><li>D5:an approach for measuring semantic similarity between words using multiple information sources </li></ul><ul><li>D6:determinging semantic similarity among entity classes from different ontologies </li></ul><ul><li>D7:strong similarity measures for ordered sets of documents in information retrieval </li></ul>T1: search(ing) T2: Engine(s) T3: Models T4: database T5: query T6: language T7: documents T8: measur(es,ing) T9: conceptual T10: dependence T11: domain T12: structure T13: similarity T14: semantic T15: ontologies T16: information T17: retrieval Query: Semantic similarity measures used by search engines and other information searching mechanisms PP33
  28. 28. Example- Term-Document Matrix -Cont’ Matrix[q][A] PP34 1 1 1 1 1 2 Q 1 1 1 1 1 D7 1 1 1 D6 1 1 1 1 D5 1 1 1 D4 1 1 1 1 D3 1 1 1 D2 1 1 1 D1 T17 T16 T15 T14 13 T12 T11 T10 T9 T8 T7 T6 T5 T4 T3 T2 T1
  29. 29. Dice coefficient PP30, PP34
  30. 30. Final Results PP34 O≥C≥D≥J
  31. 31. Current Applications <ul><li>Multi-Dimensional Modeling </li></ul><ul><li>Hierarchical Clustering </li></ul><ul><li>Bioinformatics </li></ul>PP35-38
  32. 32. <ul><li>Discussion </li></ul>

×