Semantic Relatedness of Web Resources by XESA - Philipp Scholl

2,526 views

Published on

Extended Explicit Semantic Analysis for Calculating Semantic Relatedness of Web Resources

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,526
On SlideShare
0
From Embeds
0
Number of Embeds
539
Actions
Shares
0
Downloads
29
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • November 19, 2007 | |
  • November 19, 2007 | |
  • November 19, 2007 | | What’s different with snippets? Why did they use it?
  • November 19, 2007 | |
  • November 19, 2007 | |
  • November 19, 2007 | | Languages: 29 with more than 1 Mio. articles & categories & administration pages
  • November 19, 2007 | |
  • November 19, 2007 | |
  • November 19, 2007 | | As categories form different concept space, they cannot be applied directly to interpretation vector
  • November 19, 2007 | | Standard deviation: square root of variance
  • November 19, 2007 | | Trefferquote ist die Wahrscheinlichkeit, mit der ein relevantes Dokument gefunden wird. Genauigkeit ist die Wahrscheinlichkeit, mit der ein gefundenes Dokument relevant ist.
  • November 19, 2007 | |
  • November 19, 2007 | |
  • November 19, 2007 | |
  • Semantic Relatedness of Web Resources by XESA - Philipp Scholl

    1. 1. Extended Explicit Semantic Analysis for Calculating Semantic Relatedness of Web Resources Presentation 2010/10/01 EC-TEL, Barcelona 2010-10-01 EC-TEL Presentation Scholl.ppt Recommendation WP WP WP WP WP WP
    2. 2. Outline <ul><li>A Learning Scenario – Knowledge Networks and Snippets </li></ul><ul><li>Measuring Semantic Relatedness with ESA </li></ul><ul><li>Proposed Enhancements to ESA </li></ul><ul><li>Evaluation </li></ul><ul><li>Conclusions & Outlook </li></ul>
    3. 3. Scenario: Crokodil <ul><li>Crokodil – supporting Resource based Learning with Web Resources </li></ul><ul><ul><li>Collecting Fragments of Web Resources (“Snippets”) </li></ul></ul><ul><ul><li>Organize Snippets via (semantic) tagging (with types Person , Event , Goal , Location , …) </li></ul></ul><ul><ul><li>Underlying structure: Personal and Community Knowledge Networks </li></ul></ul><ul><li>Embedded as an add-on into the sidebar of the web browser Firefox </li></ul>
    4. 4. Study Results: Snippets of Web Resources <ul><li>Participants of study [SBB09] found saving fragments of web resources (instead of whole web pages) very useful </li></ul><ul><li> Snippets ≡ Fragments of web resources </li></ul><ul><ul><li>Definite, narrow scope of topic </li></ul></ul><ul><ul><li>Cover user’s information needs </li></ul></ul><ul><li>Findings in Study [SBB09 ] </li></ul><ul><ul><li>Comparison 1357 snippets vs. 705 web resources </li></ul></ul><ul><ul><li>Snippets: 70% smaller than 100 words </li></ul></ul><ul><ul><li>Web resources: 70% smaller than 1000 words </li></ul></ul>[SBB09] Scholl, P., Benz, B. F., Böhnstedt, D., Rensing, C., Schmitz, B., Steinmetz, R. (2009): Implementation and Evaluation of a Tool for Setting Goals in Self-Regulated Learning with Web Resources, In: Learning in the Synergy of Multiple Disciplines, EC-TEL 2009 , pp. 521-534, Springer-Verlag Berlin/Heidelberg
    5. 5. Structural Recommendations <ul><li>Suggesting related resources in Crokodil: based on structure of knowledge network </li></ul><ul><ul><ul><li>Whether the resource has already been saved in the personal or community knowledge network </li></ul></ul></ul><ul><ul><ul><li>Based on explicit connections between current web resource and tags </li></ul></ul></ul>Recommendation Blog entry: Visualization of Learning with Web 2.0 Paper excerpt: Social Network Analysis and Visualizations for Learning Web 2.0 Life long learning EC-TEL 2010 E-Learning TEL
    6. 6. Challenge: Sparse Knowledge Networks <ul><li>Direct, explicit connections do not always exist </li></ul><ul><li>Knowledge Networks are sparse </li></ul>Goal: semantic recommendation based on snippets.  Some measure of similarity / relatedness between snippets is needed for recommendation Blog entry: e-learning in Web 2.0 Paper excerpt: Web 2.0 for learning Web 2.0 Life long learning TEL E-learning Recommendation ?
    7. 7. Implications for Recommending Snippets <ul><li>Snippets </li></ul><ul><ul><li>Are mostly short </li></ul></ul><ul><ul><li>Have only few significant terms </li></ul></ul><ul><ul><li>Learning scenario needs recommendation of related , not necessarily similar snippets  Semantic Relatedness vs. Semantic Similarity </li></ul></ul><ul><li>Challenge: Vocabulary gap </li></ul><ul><ul><li>Different wording and terminology </li></ul></ul><ul><ul><li>Only marginally similar in terminology, but semantically strongly related </li></ul></ul><ul><li> Naïve Bag-Of-Words approach not feasible for comparison </li></ul><ul><li>One approach to accomodate these properties: Explicit Semantic Analysis </li></ul>“ TEL refers to the assistance of activities in knowledge acquisition through technology ” “ E-Learning comprises all forms of electronically supported learning and teaching .” ?
    8. 8. Outline <ul><li>A Learning Scenario – Knowledge Networks and Snippets </li></ul><ul><li>Measuring Semantic Relatedness with ESA </li></ul><ul><li>Proposed Enhancements to ESA </li></ul><ul><li>Evaluation </li></ul><ul><li>Conclusions & Outlook </li></ul>
    9. 9. <ul><li>Calculates relatedness between words / text [GM07] </li></ul><ul><ul><li>Based on reference corpus containing semantically distinct documents </li></ul></ul><ul><ul><li>Allows comparison between conceptualized abstractions of documents </li></ul></ul><ul><li>Resulting semantic vector i esa can be compared to other vectors (e.g. by cosine measure) </li></ul>Base Approach: Explicit Semantic Analysis ESA* x = |terms|×1 n×1 document d 1 n documents from corpus Preprocessing steps* Semantic interpretation Matrix M int <ul><li>* Contain: </li></ul><ul><li>Tokenization </li></ul><ul><li>Stemming </li></ul><ul><li>Calculation of TF-IDF </li></ul>Semantic interpretation vector i esa n×|terms| n 1×|terms| vectors document d 2 comparison [GM07] Gabrilovich, E. & Markovitch, S. (2007): Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6-12
    10. 10. Wikipedia as Reference Corpus <ul><li>ESA commonly uses Wikipedia as feasible reference corpus </li></ul><ul><li>Wikipedia: </li></ul><ul><ul><li>Collaboratively edited encyclopedic knowledge </li></ul></ul><ul><ul><ul><li>German Wikipedia: 1 Mio. articles </li></ul></ul></ul><ul><ul><li>Each article corresponds to a semantic concept (topic) </li></ul></ul><ul><ul><li>Articles are densely interconnected by Wiki-Links </li></ul></ul><ul><ul><ul><li>German Wikipedia: 25 Mio. links </li></ul></ul></ul><ul><ul><li>Articles are semantically grouped into categories </li></ul></ul><ul><ul><ul><li>German Wikipedia: 122k categories </li></ul></ul></ul><ul><ul><li>Articles are connected to corresponding / similar articles in other languages (266 languages available) </li></ul></ul>Source: wikipedia.org
    11. 11. Observation and Hypothesis <ul><li>Observation: </li></ul><ul><ul><li>ESA only considers article text </li></ul></ul><ul><ul><li>Ignores semantic information contained in Wikipedia that can be used: </li></ul></ul><ul><ul><ul><li>Connectivity by links </li></ul></ul></ul><ul><ul><ul><li>Category information </li></ul></ul></ul><ul><li> Implement different enhancements by semantic enrichment: eXtended Explicit Semantic Analysis (XESA) </li></ul><ul><li>Hypothesis: </li></ul><ul><ul><li>Semantically enriching interpretation vector by using this additional information readily provided by Wikipedia enhances task of comparing snippets </li></ul></ul>
    12. 12. Outline <ul><li>A Learning Scenario – Knowledge Networks and Snippets </li></ul><ul><li>Measuring Semantic Relatedness with ESA </li></ul><ul><li>Proposed Enhancements to ESA </li></ul><ul><li>Evaluation </li></ul><ul><li>Conclusions & Outlook </li></ul>
    13. 13. XESA – Overview ESA XESA AG XESA CAT XESA AG+CAT Article content Article Graph Category Information
    14. 14. Article Graph Extension <ul><li>Additional factors (not shown here): </li></ul><ul><li>Article-Link weight w link-weight – determines weight of Article Graph </li></ul><ul><li>selectBestN – selection of only n best values of i esa for complexity reduction </li></ul>General Relativity Albert Einstein Gravitation Space Matter Curvature Black Hole Catholic School Jewish Ulm Article Graph Matrix A |articles|×1 Semantic interpretation vector i esa x |articles|×|articles| = |articles|×1 i esa_AG
    15. 15. Category Graph Extension <ul><li>As categories are appended concept space, the resulting interpretation vector has more dimensions </li></ul>General Relativity Fundamental Physics Concepts General Relativity Misner Space Anti-Gravity Atom Heat Concepts of Heaven Relativity Theories of Gravitation Physics Concepts by Field Frames of Reference Category Graph Matrix A |art|×1 Semantic interpretation vector i esa x |cat+art|×|art| = |cat+art|×1 i esa_AG
    16. 16. Outline <ul><li>A Learning Scenario – Knowledge Networks and Snippets </li></ul><ul><li>Measuring Semantic Relatedness with ESA </li></ul><ul><li>Proposed Enhancements to ESA </li></ul><ul><li>Evaluation </li></ul><ul><li>Conclusions & Outlook </li></ul>
    17. 17. Evaluation: Development of an Own Corpus <ul><li>12 Participants were asked to answer questions with snippets </li></ul><ul><li>Task: find snippets answering 10 different questions in 5 flavors </li></ul><ul><ul><li>Facts (“What is FTAA”) </li></ul></ul><ul><ul><li>Opinions (“Is the term ‘dark ages’ justifiable?”) </li></ul></ul><ul><ul><li>Homonyms (“What is Java?”) </li></ul></ul><ul><ul><li>Loosely coupled topics (“How are sweets produced?”) </li></ul></ul><ul><ul><li>Wide topics (“What is origin of human race?”) </li></ul></ul><ul><ul><li>+ sub-groups where meaning is ambiguous (e.g. Java programming language vs. Indonesian island Java) </li></ul></ul><ul><li>Different search engines used (Google, Bing, Yahoo!, …), resulting in 282 distinct snippets. </li></ul><ul><li>Note: Created corpus corresponds to our definition of snippets </li></ul><ul><ul><li>ø 95 terms, min 5, max 756, standard deviation 71.3 </li></ul></ul>
    18. 18. Evaluation: Methodology <ul><li>Evaluation: ESA vs. XESA </li></ul><ul><ul><li>As we do not have pair comparisons for all snippets, the rank is important: relevant and similar snippets should be delivered first </li></ul></ul><ul><li>Evaluation methodology break-even-point from search engines </li></ul><ul><ul><li>Definition: break-even-point is measure where precision and recall of a query are equal. The higher, the better. </li></ul></ul><ul><ul><li>Average Interpolated Precision is average of all comparison of all snippets </li></ul></ul><ul><ul><li>Displaying as Precision – Recall diagram </li></ul></ul><ul><li>Baseline ESA: </li></ul><ul><ul><li>Break-even point at 0.595 </li></ul></ul>0.595
    19. 19. Evaluation: Comparing Approaches <ul><li>Selected parameters (adjusted experimentally) </li></ul><ul><ul><li>selectBestN: n = 25 </li></ul></ul><ul><ul><li>Article Link weight: w є {0.5, 0.75} does not make significant difference </li></ul></ul><ul><li>Best results </li></ul><ul><ul><li>XESA AG(B) (0.643), but no significant difference from XESA AG(A) (0.641) </li></ul></ul><ul><ul><li>~ 9% better than ESA </li></ul></ul><ul><ul><li>XESA CAT is good, but cannot catch up </li></ul></ul><ul><ul><li>XESA AG+CAT performs worse than ESA </li></ul></ul>0.643 0.641 0.620 0.543
    20. 20. Outline <ul><li>A Learning Scenario – Knowledge Networks and Snippets </li></ul><ul><li>Measuring Semantic Relatedness with ESA </li></ul><ul><li>Proposed Enhancements to ESA </li></ul><ul><li>Evaluation </li></ul><ul><li>Conclusions & Outlook </li></ul>
    21. 21. Recommending via Semantic Relatedness Recommendation Semantic Relatedness (XESA) WP WP WP WP WP WP Paper excerpt: Social Network Analysis and Visualizations for Learning Web 2.0 Life long learning E-Learning TEL Blog entry: Visualization of Learning with Web 2.0
    22. 22. Conclusions and Future Work <ul><li>Using Wikipedia as a reference corpus for calculating semantic relatedness for snippets is feasible </li></ul><ul><ul><li>Enhancing ESA by integrating Wikipedia’s rich semantic structure yields better results </li></ul></ul><ul><ul><ul><li>Article Graph improves ESA up to 9% </li></ul></ul></ul><ul><ul><li>Performance: not yet applicable to online scenarios </li></ul></ul><ul><li>Future Work: </li></ul><ul><ul><li>Next step: implement semantic relatedness in recommendations </li></ul></ul><ul><ul><li>Coping with large datasets: make approach performing in real-life contexts </li></ul></ul><ul><ul><li>Calculate cut-off for “good” concept terms (dimension reduction) </li></ul></ul><ul><ul><li>Measuring similarity between documents in different languages </li></ul></ul>
    23. 23. Questions? … Thank you for your attention! This work was supported by funds from the German Federal Ministry of Education and Research under the mark 01 PF 08015 A and from the European Social Fund of the European Union (ESF).

    ×