An Empirical Study of Vocabulary Relatedness and Its Application to Recommender Systems

843 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
843
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An Empirical Study of Vocabulary Relatedness and Its Application to Recommender Systems

  1. 1. .nju.edu.cn An Empirical Study of Vocabulary Relatedness and Its Application to Recommender Systems Gong Cheng, Saisai Gong, Yuzhong QuState Key Laboratory for Novel Software Technology, Nanjing University, China gcheng@nju.edu.cn Presented at ISWC2011
  2. 2. ws .nju.edu.cn Measuring term similarity 0.9 FacultyMember Faculty FullProfessor 0.8 Professor AssistantProfessor AssistantProfessor Vocabulary matching 1.0Gong Cheng (程龚) gcheng@nju.edu.cn 2 of 36
  3. 3. ws .nju.edu.cn Measuring vocabulary similarity Semantic Web for Research Communities (SWRC) Foundational Model of Anatomy (FMA) 0.8 0.5 Vocabulary distance GALEN 0.6 0.02 eBiquity Person 0.5 NCBI organismal classification Vocabulary matching (NCBITaxon)Gong Cheng (程龚) gcheng@nju.edu.cn 3 of 36
  4. 4. ws .nju.edu.cn Measuring vocabulary relatedness Vocabulary relatedness FacultyMember Postgraduate-Research- Degree Vocabulary distance FullProfessor PhD EngD AssistantProfessor Vocabulary matching not that similar, but somewhat relatedGong Cheng (程龚) gcheng@nju.edu.cn 4 of 36
  5. 5. Contributions ws .nju.edu.cn How to measure vocabulary relatedness? 6 measures, from 4 aspects How about vocabulary relatedness in real-life cases? Empirical analysis of 2,996 vocabularies and other 4 billion RDF triples Where to apply vocabulary relatedness? Post-selection vocabulary recommendation in vocabulary searchGong Cheng (程龚) gcheng@nju.edu.cn 5 of 36
  6. 6. Outline ws .nju.edu.cn Data set Vocabulary relatedness Post-selection vocabulary recommendation ConclusionsGong Cheng (程龚) gcheng@nju.edu.cn 6 of 36
  7. 7. Data set statistics ws .nju.edu.cn Crawled from February 2010 to May 2011 byGong Cheng (程龚) gcheng@nju.edu.cn 7 of 36
  8. 8. Data set distributions ws .nju.edu.cn RDF documents over pay-level domainsGong Cheng (程龚) gcheng@nju.edu.cn 8 of 36
  9. 9. Data set distributions ws .nju.edu.cn Vocabularies over top-level domainsGong Cheng (程龚) gcheng@nju.edu.cn 9 of 36
  10. 10. Outline ws .nju.edu.cn Data set Vocabulary relatedness Post-selection vocabulary recommendation ConclusionsGong Cheng (程龚) gcheng@nju.edu.cn 10 of 36
  11. 11. Vocabulary relatedness ws .nju.edu.cn 6 numerical measures, from 4 aspects Semantic relatedness Explicit Implicit Hybrid Content similarity Expressivity closeness Distributional relatedness ComparisonGong Cheng (程龚) gcheng@nju.edu.cn 11 of 36
  12. 12. Measure 1: explicit semantic relatedness ws .nju.edu.cn E 1 RS v i , v j weight of a shortestpathbetween vi and v j in GE 1 2 GE v1 v2 v3 owl:imports owl:priorVersion v1 v3 v2 rdfs:seeAlsoGong Cheng (程龚) gcheng@nju.edu.cn 12 of 36
  13. 13. Measure 2: implicit semantic relatedness ws .nju.edu.cn I 1 RS v i , v j weight of a shortestpathbetween vi and v j in GI 1 2 GI v2 v3 v4 owl:inverseOf rdfs:subClassOf t2 t4 t3 owl:inverseOf v2 v3 v4Gong Cheng (程龚) gcheng@nju.edu.cn 13 of 36
  14. 14. Measure 3: hybrid semantic relatedness ws .nju.edu.cn E I 1 RS vi , v j weight of a shortestpathbetween vi and v j in GE I 1 v2 GE+I 1 v4 v1 2 v3Gong Cheng (程龚) gcheng@nju.edu.cn 14 of 36
  15. 15. Empirical analysis (1) ws .nju.edu.cn Statistical properties of GE, GI and GE+IGong Cheng (程龚) gcheng@nju.edu.cn 15 of 36
  16. 16. Empirical analysis (2) ws .nju.edu.cn Explicit relations between vocabulariesGong Cheng (程龚) gcheng@nju.edu.cn 16 of 36
  17. 17. Measure 4: content similarity ws .nju.edu.cn Harmonic mean Maximum similarity between their labelsGong Cheng (程龚) gcheng@nju.edu.cn 17 of 36
  18. 18. Empirical analysis (3) ws .nju.edu.cn 86 label-like properties rdfs:label, dc:title, and their subproperties (e.g. skos:prefLabel) and local name Terms and their labels Vocabulary distribution 36.33% 36.21% 63.67% w/ w/ 63.79% w/o w/oGong Cheng (程龚) gcheng@nju.edu.cn 18 of 36
  19. 19. Measure 5: expressivity closeness ws .nju.edu.cn tp owl:TransitiveProperty MetaTerms rdfs:domain owl:TransitiveProperty owl:inverseOf rdf:type tq tr JaccardGong Cheng (程龚) gcheng@nju.edu.cn 19 of 36
  20. 20. Empirical analysis (4) ws .nju.edu.cn 4,978 meta-level terms, 469 (9.42%) in >1 vocabulary Most popular meta-level terms 1. rdf:type 2. rdfs:domain 3. rdfs:range 4. … and after excluding language constructs 10.13 meta-level terms per vocabulary ≤20 meta-level terms in 92.96% vocabularies but hundreds in CycGong Cheng (程龚) gcheng@nju.edu.cn 20 of 36
  21. 21. Measure 6: distributional relatedness ws .nju.edu.cn Distributional profile p v1 | v p v2 | v DP v RD vi , v j cos DP vi , DP v j ... p vn | vGong Cheng (程龚) gcheng@nju.edu.cn 21 of 36
  22. 22. Empirical analysis (5) ws .nju.edu.cn Instantiation found for 1,874 (62.55%) vocabularies Most popular vocabularies (excluding languages)Gong Cheng (程龚) gcheng@nju.edu.cn 22 of 36
  23. 23. Empirical analysis (6) ws .nju.edu.cn Co-instantiation found for 9,763 pairs of vocabularies Most popular vocabulary co-instantiation (excluding languages)Gong Cheng (程龚) gcheng@nju.edu.cn 23 of 36
  24. 24. Vocabulary relatedness ws .nju.edu.cn 6 numerical measures, from 4 aspects Semantic relatedness Explicit Implicit Hybrid Content similarity Expressivity closeness Distributional relatedness ComparisonGong Cheng (程龚) gcheng@nju.edu.cn 24 of 36
  25. 25. Agreement between measures ws .nju.edu.cn Spearman’s rank correlation coefficient (ρ∈[-1,1]) Single-link hierarchical clusteringGong Cheng (程龚) gcheng@nju.edu.cn 25 of 36
  26. 26. Outline ws .nju.edu.cn Data set Vocabulary relatedness Post-selection vocabulary recommendation ConclusionsGong Cheng (程龚) gcheng@nju.edu.cn 26 of 36
  27. 27. Relatedness-based ranking ws .nju.edu.cn Ranking by single measure: Ranking by multiple measures:Gong Cheng (程龚) gcheng@nju.edu.cn 27 of 36
  28. 28. Popularity-based re-ranking ws .nju.edu.cn Degree of influence of popularity Number of pay-level domains instantiating viGong Cheng (程龚) gcheng@nju.edu.cn 28 of 36
  29. 29. Evaluation settings ws .nju.edu.cn 20 “selections” randomly selected from 1,302 moderate-sized vocabularies Depth-10 pooling with 2 experts Ratings Closely related: 2 Somewhat related: 1 Unrelated: 0 Metric: NDCGGong Cheng (程龚) gcheng@nju.edu.cn 29 of 36
  30. 30. Gold standard ws .nju.edu.cn 739 assessments Assessments 7.85% Closely related 10.55% 81.60% Somewhat related Unrelated Agreement between experts 80% or 91% when “closely related = somewhat related = related”Gong Cheng (程龚) gcheng@nju.edu.cn 30 of 36
  31. 31. Evaluation results --- individual measures ws .nju.edu.cn 56.88% isolated vocabularies in GE 37.45% uninstantiated vocabulariesGong Cheng (程龚) gcheng@nju.edu.cn 31 of 36
  32. 32. Evaluation results --- combinations of measures ws .nju.edu.cnGong Cheng (程龚) gcheng@nju.edu.cn 32 of 36
  33. 33. Relatedness vs. popularity ws .nju.edu.cn NDCG@1 vs. number of pay-level domains instantiating itGong Cheng (程龚) gcheng@nju.edu.cn 33 of 36
  34. 34. Outline ws .nju.edu.cn Data set Vocabulary relatedness Post-selection vocabulary recommendation ConclusionsGong Cheng (程龚) gcheng@nju.edu.cn 34 of 36
  35. 35. Conclusions ws .nju.edu.cn Vocabulary-level relatedness 4 aspects, 6 measures Empirical analysis Statistical findings Comparison Post-selection vocabulary recommendation Relatedness-based ranking Popularity-based re-ranking Evaluation Falcons Ontology Search http://ws.nju.edu.cn/falcons/ontologysearch/Gong Cheng (程龚) gcheng@nju.edu.cn 35 of 36
  36. 36. Take away ws .nju.edu.cn Vocabulary meta-descriptions are incomplete. Terms lack labels. Co-instantiated ∝ explicitly related http://ws.nju.edu.cn/falcons/ontologysearch/Gong Cheng (程龚) gcheng@nju.edu.cn 36 of 36

×