Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Philosophy of IR Evaluation Ellen Voorhees


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Philosophy of IR Evaluation Ellen Voorhees

  1. 1. The Philosophy of Information Retrieval Evaluation (2001) by Ellen Voorhees
  2. 2. The Author• Computer scientist, Retrieval Group, NIST (15 years) o TREC, TRECVid , and TAC - large-scale evaluation of technologies for processing natural language text and searching diverse media types• Research focus: "developing and validating appropriate evaluation schemes to measure system effectiveness in these areas"• Siemens Corporate Research (9 years) o factory automation, intelligence agents, agents applied to information access
  3. 3. NIST (National Institute of Standards andTechnology)• Non-regulatory agency of U.S. Dept of Commerce• "Promote U.S. innovation and industrial competitiveness [...] enhance economic security and improve our quality of life"• Estimated 2011 budget: $722 million• Standards Reference Materials (experimental control samples, quality control benchmarks), election technology, ID cards• 3 Nobel Prize Winners
  4. 4. Premises• User-based evaluation (p.1) o better, more direct measure of user needs o BUT very expensive and difficult to execute properly• System evaluation (p.1) o less expensive o abstraction of retrieval process o can control variables increases power of comparative experiments o diagnostic information about system behavior
  5. 5. The Cranfield Paradigm• Dominant model for 4 decades (p.1)• Cranfield 2 experiment (1960s) - first lab testing of IR system (p.2) o investigated which indexing languages is best o design: considering the performance of index languages free from operational variable contamination o aeronautics experts, aeronautics collection o test collection: documents, information needs/topics, relevance judgment set o assumptions: relevance approximated by topical similarity single judgment set representative of user population lists of relevant documents for each topic complete
  6. 6. Modern Adaptations to CranfieldParadigm not true, need to decrease noise (p.3)• Assumptions o modern collections larger and more diverse o less complete relevance judgments• Adaptations: o Ranked list of documents for each topic ordered by decreasing retrieval likelihood o Effectiveness as a whole computed as average across topics o Large number of topics o Use pooling (subsets of documents) instead (p.4) o Assumptions dont need to be strictly true for test collection to be viable different retrieval run scores compared on same test collections
  7. 7. How to Build a Test Collection(TREC example)• Set of documents and topics (reflective of operational setting and real tasks) (p.4) o e.g. law articles for law library• Participants run topics against documents o return top documents per topic• Pool formed, then judged by relevance assessors o evaluated using relevance judgments (binary)• Results returned to participant• Relevance judgments turn documents and topics into test collection (p.5)
  8. 8. Effects of Pooling and Incomplete Judgments• Pooling doesnt produce complete judgments (p.5) o Some relevant documents not judged o If added later, from lower in system rankings• Skewed across topics (p.6) o if have many relevant documents initially and later on• What to do? o deep and diverse pool (p.9) o recall-oriented manual runs to supplement o opt for smaller, fair judgment set rather than larger biased set
  9. 9. Assessor Relevance Judgments• Different judges, different time settings (p.9)• Different assessor makes different relevance sets for same topics (subjectivity of relevance)• TREC: 3 judges (p.10)• Overlap < 50%, assessors really disagreed
  10. 10. Evaluating with Assessor Inconsistency• Perform system ranking, sorting by value obtained by each system (p.10)• Query-Relevance Set: different combinations of assessor judgments per topic• Repeat experiments several times: (p.13) o different measures o different topic sets o different systems o different assessor groups• Comparative evaluation result: stability of ranked retrieval results
  11. 11. Cross-Language Collections• More difficult to build than monolingual collections (p.13) o separate set of assessors for each language o multiple assessors for 1 topic o need diverse pools for all languages minority language pools smaller and less diverse (p.14)• What to do? o close coordination for consistency (p.13) o proceed with care
  12. 12. Discussion• Do laboratory experiments translate to operational settings?• Which metrics or evaluation scores are more meaningful to you?• Are there other ways to reduce noise and error?