EVALUATION Evaluation form: Evaluation criteria: Modified Average Precision Example:  AveP = (0+1/2+2/3) /3 = 0.389 Input page Recommended page Relevance http://en.wikipedia.org/wiki/Natural_language_processing http://research.microsoft.com/jump/50176 0 http://nlp.stanford.edu/ 1 http://www.aaai.org/aitopics/html/natlang.html 1
TEST DATA SELECTION Input pages from 5 topics: “ Harry Porter” “ Waterboarding” “ Wei Chen@CMU homepage” “ Entropy (thermodynamics)” “ How to make Sushi” Dimensions  Popular vs. Unpopular (“Harry Porter”, “Wei Chen”)  Ambiguous vs. Unambiguous (“Entropy”, “Sushi”) New vs. Old (“Waterboarding”, “Entropy”) Procedural vs. Conceptual (“How to”, “Entropy”) Technological vs. Mass media (“Entropy”, “Harry Porter”)
TEST DATA We evaluate on 5 topics and 3 algorithms. We have total of 15 categories. Each category has 5 recommended WebPages. We have total of 5 evaluators. Each of them scored 75 web pages.
AVERAGE PRECISION
AVERAGE ON ALGORITHMS
AVERAGE ON TOPICS
KAPPA We can achieve very good inter-coder agreement, if we revise our score criteria. We all seem to agree with Anthony (maybe we should ask him to revise our score criteria).
CONCLUSION Topics play an important role in the evaluation results. The more popular and resourceful the topic is the better the evaluation results are. The time sensitive topic has the highest invalid page rate. At this point we cannot make any conclusion about our algorithms. Only Structure algorithm seems better.  We don’t know what makes the difference in the evaluation results of the three algorithms. We need to design a new experiment to analyze query terms which the three algorithms generated in order to answer this question. We should include the condition in which it uses the human generated query terms as our control condition in our new experiment.

Evaluation Result

  • 1.
    EVALUATION Evaluation form:Evaluation criteria: Modified Average Precision Example: AveP = (0+1/2+2/3) /3 = 0.389 Input page Recommended page Relevance http://en.wikipedia.org/wiki/Natural_language_processing http://research.microsoft.com/jump/50176 0 http://nlp.stanford.edu/ 1 http://www.aaai.org/aitopics/html/natlang.html 1
  • 2.
    TEST DATA SELECTIONInput pages from 5 topics: “ Harry Porter” “ Waterboarding” “ Wei Chen@CMU homepage” “ Entropy (thermodynamics)” “ How to make Sushi” Dimensions Popular vs. Unpopular (“Harry Porter”, “Wei Chen”) Ambiguous vs. Unambiguous (“Entropy”, “Sushi”) New vs. Old (“Waterboarding”, “Entropy”) Procedural vs. Conceptual (“How to”, “Entropy”) Technological vs. Mass media (“Entropy”, “Harry Porter”)
  • 3.
    TEST DATA Weevaluate on 5 topics and 3 algorithms. We have total of 15 categories. Each category has 5 recommended WebPages. We have total of 5 evaluators. Each of them scored 75 web pages.
  • 4.
  • 5.
  • 6.
  • 7.
    KAPPA We canachieve very good inter-coder agreement, if we revise our score criteria. We all seem to agree with Anthony (maybe we should ask him to revise our score criteria).
  • 8.
    CONCLUSION Topics playan important role in the evaluation results. The more popular and resourceful the topic is the better the evaluation results are. The time sensitive topic has the highest invalid page rate. At this point we cannot make any conclusion about our algorithms. Only Structure algorithm seems better. We don’t know what makes the difference in the evaluation results of the three algorithms. We need to design a new experiment to analyze query terms which the three algorithms generated in order to answer this question. We should include the condition in which it uses the human generated query terms as our control condition in our new experiment.