SE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation


Published on

Presented at SE@M 2010 - Fourth International Workshop on Search and Exchange of e-le@rning Materials (SE@M’10). 27-28 September 2010, Barcelona, Spain

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

SE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation

  1. 1. Automatic Keywords Extraction – A Basis for Content Recommendation Ivana Bosnić Katrien Verbert Erik Duval University of Zagreb, Croatia Katholieke Universiteit Leuven, Belgium
  2. 2. Toda y... <ul><li>General idea </li></ul><ul><li>and use case </li></ul><ul><li>Evaluation of services </li></ul><ul><li>Evaluation in </li></ul><ul><li>the environment </li></ul>
  3. 3. General idea and use case 1
  4. 4. Content authoring <ul><li>Lacking inspiration </li></ul><ul><ul><li>Relevant content you should take a look at </li></ul></ul><ul><li>Educational context </li></ul><ul><ul><li>What is this content for? </li></ul></ul><ul><ul><li>Who is making it? </li></ul></ul><ul><li>Hard to reuse </li></ul><ul><ul><li>Referencing </li></ul></ul><ul><ul><li>Copy-pasting </li></ul></ul>
  5. 5. Use case Content available Broader c ontext Content Reuse Metadata LORs Wikipedia Blogs LMS Course context Author Current content Tools Integration Referencing
  6. 6. Current work… <ul><li>Web application </li></ul><ul><ul><li>WikiMedia / WikiPres integration </li></ul></ul><ul><li>Recommending content </li></ul><ul><ul><li>Keywords: Zemanta API </li></ul></ul><ul><ul><li>Wikipedia, Blogs </li></ul></ul><ul><ul><li>GLOBE repository (REST interface) </li></ul></ul><ul><li>Context </li></ul><ul><ul><li>Presentation slides </li></ul></ul><ul><li>Reuse </li></ul><ul><ul><li>Referencing the content </li></ul></ul>
  7. 7. Ke ywords <ul><li>Basis for generating search terms </li></ul><ul><li>2 groups of generator services </li></ul><ul><ul><li>Term extraction </li></ul></ul><ul><ul><ul><li>Yahoo Term Extraction Web services, Fivefilters </li></ul></ul></ul><ul><ul><li>Semantic entity generation </li></ul></ul><ul><ul><ul><li>Zemanta, OpenCalais, Evri, Achemy API </li></ul></ul></ul>
  8. 8. Keyword extractors – a closer look <ul><li>Yahoo Term Extraction Service </li></ul><ul><ul><li>Up to 20 keywords found in text </li></ul></ul><ul><ul><li>No ranking </li></ul></ul><ul><ul><li>Generates a part of metadata in GLOBE (SAmgI) </li></ul></ul><ul><li>Zemanta </li></ul><ul><ul><li>Up to 8 keywords, not necessarily in text </li></ul></ul><ul><ul><li>Relevance ranking </li></ul></ul><ul><ul><li>Recommends images, links, blogs/news </li></ul></ul><ul><ul><li>Emphasising words and influencing the extraction </li></ul></ul>
  9. 9. Keyword extractors evaluation 2
  10. 10. Goals <ul><li>Testing extraction services with already existing educational content </li></ul><ul><li>Comparing Zemanta & Y! Term Extraction </li></ul><ul><li>Learning about user-generated queries </li></ul>
  11. 11. Methodology <ul><li>6 users </li></ul><ul><li>9 presentations found through Google </li></ul><ul><ul><li>Topics: open source, databases, gravity </li></ul></ul><ul><ul><li>Text extracted from 3 adjacent slides </li></ul></ul><ul><ul><li>Different content properties (general, specific, full sentences, bullets...) </li></ul></ul><ul><li>Users creating the queries </li></ul><ul><li>Users grading the generated keywords </li></ul>
  12. 12. Automatically extracted keywords: user grading I Average of keyword relevancy grading, per presentation
  13. 13. Automatically extracted keywords: user grading II Average of keyword relevancy grading, per topic
  14. 14. Automatically extracted keywords: user grading III Average user grading per particular Zemanta rank
  15. 15. User-generated keywords <ul><li>Comparing the differences between user and automatic generation </li></ul><ul><li>Chosen keywords: common to 2+ users </li></ul><ul><li>2 comparisons: </li></ul><ul><ul><li>Exact match </li></ul></ul><ul><ul><li>Similar match </li></ul></ul>
  16. 16. User-generated keywords Matches between user- and automatically generated keywords
  17. 17. Lessons learned <ul><li>Zemanta won  </li></ul><ul><ul><li>not an extensive evaluation, though  </li></ul></ul><ul><li>Presentations prepared beforehand </li></ul><ul><ul><li>Additional problems occur on-the-fly </li></ul></ul>
  18. 18. Evaluation in the environment 3
  19. 19. Goals <ul><li>Analyzing the use of keywords as the basis for recommendations </li></ul><ul><ul><li>while the content is being authored </li></ul></ul><ul><li>Evaluation the relevancy of keywords </li></ul><ul><li>Part of usabilty evaluation </li></ul>
  20. 20. Methodology <ul><li>4 users </li></ul><ul><li>Make a presentation on programming topic </li></ul><ul><ul><li>Topics: HTML (x2), databases, XML </li></ul></ul><ul><li>Rank the 5 best keywords of the 8 proposed </li></ul>
  21. 21. Environment <ul><li>Wiki + slides editing -> WikiPres </li></ul>
  22. 23. Evaluation I The relation between user and internat ranking The average of user rankings
  23. 24. Evaluation I - problems <ul><li>Content cold start </li></ul><ul><li>Semantic relation of words </li></ul><ul><li>Unnecessary text marku p </li></ul><ul><li>Ambiguity </li></ul>
  24. 25. Evaluation I - changes <ul><li>Including the content from previous slides </li></ul><ul><li>Slide t itle emphasis </li></ul><ul><li>Text cleaning </li></ul>
  25. 26. Evaluation II <ul><li>The same evaluation methodology </li></ul><ul><li>Additional goal </li></ul><ul><ul><li>analyzing the influence of text scenarios </li></ul></ul><ul><ul><ul><li>including an example </li></ul></ul></ul><ul><ul><ul><li>changing the sub-topic in the presentation </li></ul></ul></ul><ul><ul><ul><li>more general topic (open source) </li></ul></ul></ul>
  26. 27. Evaluation II The relation between user and internat ranking The average of user rankings
  27. 28. Lessons learned <ul><li>Majority of best-ranked keywords in Zemanta TOP 5 </li></ul><ul><li>Scenarios’ problems: </li></ul><ul><ul><li>example: banking for database systems </li></ul></ul><ul><ul><li>open source: lower ranking </li></ul></ul><ul><ul><li>dynamic HTML: relation to general topic </li></ul></ul><ul><li>Few words per slide </li></ul><ul><ul><li>slides created for evaluation purposes </li></ul></ul>
  28. 29. All in all... <ul><li>Zemanta fits the intended purpose </li></ul><ul><li>5 highest ranked keywords can be used </li></ul><ul><li>TODO: </li></ul><ul><ul><li>keyword classification schemes? </li></ul></ul><ul><ul><li>folksonomies? </li></ul></ul><ul><ul><li>context information extraction and mapping? </li></ul></ul>
  29. 30. Some hard questions... <ul><li>Will the keywords be found in metadata? </li></ul><ul><li>Do more relevant keywords produce more relevant recommendations? </li></ul><ul><li>How not to omit the relevant content? </li></ul>
  30. 31. Thanks  <ul><li>Ivana . bosnic at fer . hr </li></ul>