Similarity Measurement Preliminary Results

974 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
974
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Spearman’s rank correlation
  • Similarity Measurement Preliminary Results

    1. 1. Similarity measurement:Folksonomyvs.LSA<br />Preliminary Results<br />
    2. 2. The Tripartite structure of tagging<br />Folksonomy is a set of triples &lt; user , tag, object&gt;<br />A folksonomy is a tuple F :=(U, T, R, Y) where U, T, and R are finite sets, whose elements are called users, tags and resources. Y is a ternary relation between user, tags and resources. <br />
    3. 3. Del.icio.us Tag distribution<br />Tag distribution<br />Log-log Tag distribution<br />After crawling the delicious.com site, the total number tags (tokens) obtained was 7,528,528, among which the number of types was 188,964. All the tags are stemmed using the Porter Stemmer and the total number of stemmed tags ended up to be 174,887. <br />
    4. 4. LSA Processing Workflow in R<br />tm = textmatrix(‘dir/‘)<br />tm = lw_logtf(tm) * gw_idf(tm)<br />space = lsa(tm, dims=dimcalc_share())<br />as.textmatrix(tm)<br />
    5. 5. LSA corpus preparing<br />A total number of 17,085 web pages were crawled and were later parsed to remove all the HTML markups. <br />Stemming and Stop-word removal<br />The processed corpus:14,993,620 tokens, 259,464 types of words. <br />Only words with frequency more than 100 were kept to be entered into a word-by-document matrix. There were 1047 words with frequency more than 100. The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document .<br />Therefore, the resulting term-document matrix have 3465 columns (documents) and 1082 rows (words).  <br />
    6. 6. LSA document length distribution<br />The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document .<br />
    7. 7. Three similarity measurements<br />Tag Co-coccurrence counts<br />Tag vector cosine similarity<br />LSA <br />
    8. 8. Similarity Measurement<br />Tag Co-occurrence Counts:<br /> 1)simple count: how many times two tags are used by the same user to annotate the same resource<br /> 2)normalized count: Jaccard IndexThe co-occurrence counts of tag A and tag B divided by the joint frequency of A and B.<br />
    9. 9. Distribution of Tag co-occurrence Counts (simple counts)<br />
    10. 10. Distribution of Tag co-occurrence Counts (normalized)<br />
    11. 11. Measurement 2: Cosine Similarity<br />Based on the co-occurrence vector of each tag with every other tag.<br />Since there are normalized and unnormalized tag co-occurrence counts, we then ended up with two <br />X and Y are the co-occurrence vectors of two distinct tags.<br />
    12. 12. Distribution of Tag Cosine Similarity<br />
    13. 13. Distribution of Tag Cosine Similarity based on normalized Tag Co-occurrence counts<br />
    14. 14. Results<br />The pair-wise Pearson correlation and Spearman correlation among 5 measurements [Tag-cooccurrence count, Tag cosine similarity, LSA, normalized Tag-cooccurrence count, Tag cosine similarity based on normalized tag-tag cooccurrence matrix]<br />
    15. 15. Correlation<br />
    16. 16.
    17. 17.
    18. 18.
    19. 19.
    20. 20.
    21. 21. Qualitative Insight – ‘linguistics’<br />Top 10 “linguistics” related words according to 5 measurements<br />
    22. 22. Correlation between two measurements:normalized tag co-occurrence counts vs. normalized tag cosine<br />P= 0.7547 <br />

    ×