DE Conferentie 2004 Pia Borlund

464 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
464
On SlideShare
0
From Embeds
0
Number of Embeds
70
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • It is a fact that IR systems have become more interactive, and consequently cannot be evaluated without including the interactive seeking and retrieval processes. Necessarily, the foci of IIR system evaluation include all the user's activities of interaction with the retrieval and feedback mechanisms as well as the retrieval outcome itself. This has created a demand for hybrid evaluation approaches which combine two main approaches to the evaluation of IIR systems, the system-driven and the user-centred approaches, respectively. ...and not to forget the criticisms of the conventional methods when applying the to IIR systems. Users: very time consuming evaluation often conducted by the developer himself they ought to participate in order to observe the systems’ effect on their dynamic needs opportunity to collect cognitive oriented data Need: it is assumed that the info. need is static throughout the experiment, which implies that learning and modification are confined to the search statement alone Rel: is not just binary and topical, the concept of relevance is known as a multidimensional concept The essential reason for the employment of the work task based approaches to the evaluation of IIR systems, is that the actual application of work tasks and the involvement of test persons makes it possible to facilitate the evaluation in a way which is close to actual information seeking and IR processes, though still in a relatively controlled evaluation environment. Realistic because: (1) the work task simulate a real work task situation, (2) and the involvement of potential end-users as test persons who, based on the provided work task, develop individual and subjective information needs. Individually, the test persons interactively search, modify, and assess relevance of the retrieved information objects in relation to their information needs and the underlying work task situation. Experimental control because: the work task is the same for all of the test persons. This means one can compare the search results across the systems and/or system components as well as across the group of test persons. Basically, what we want to do is to bring the users to the lab. NEXT SLIDE
  • Any one who has carried out evaluation of IR systems based on the involvement of human beings knows that this involvement has consequences as for how to relate, interpret, and compare the IR performances. In this paper we look into 2 of the problems due to the involvement of human beings in the evaluation of IR systems. In the first case we focus on how the human involvement results in a mixture of different types of objective and subjective relevance assessments. The problem in this case is about how to relate the types of relevance in way that makes it possible to interpret the relationship as an indication of the IR performance for the given types of applied relevance. As the solution to this problem we propose the Relative Relevance measure which is an associative measure of the agreement between the two applied types of relevance. The second matter we pay attention to is with reference the scattered distributions of user-generated relevance assessments. Picture a ranked list of documents which have been judged relevant by a user. The first document is relevant, but not the second and the third, the fourth document is partially relevant, the fifth is relevant and so on. Seen from the user's perspective the higher the engine can place relevant documents the better is the IR system. So we need a positional performance indicator. For this purpose we propose the Ranked Half-Life indicator which provides information about how far up the ranked list the engine does place the relevant documents. Both of these proposed measures are meant to supplement not substitute the traditional performance measures: recall and precision.
  • In our work we refer to work tasks as simulated work task situations. By doing so we simply point to the fact that the application of work tasks in this context is done by the simulation of situations where IR work tasks arise. A simulated work task situation is a short 'cover story' that describes an IR requiring situation. A simulated work task situation serves two main functions, as it: 1) triggers and develops a simulated information need by allowing for user interpretations of the situation, leading to cognitively individual information need interpretations as in real-life; 2) is the platform against which situational relevance is measured. More specifically the simulated work task situation helps to describe to the test person: the source of the information need; the environment of the situation; the problem which has to be solved; and serves to make the test person understand the objective of the search The NEXT SLIDE shows an example of a simulated work task situation.
  • Here we have an example of a simulated work task situation. This one, we refer to as sim A, and is taken from the meta-evaluation. This is the only simulated work task situation which is not generated based on Interactive TREC-6 Topics. We used the TREC Topics because we knew they would correspond to at least parts of the TREC data collection. In the present meta-evaluation we apply two different versions of the work task component. The reason is that we are interested in learning about the consequences (if any) of the level of semantic openness of simulated work task situations presented to the test persons. This results in a sim-1 version and a sim-2 version. The sim-1 version is constituted by a simulated work task situation as well as of an indicative request . The indicative request is a suggestion to the test person about what to search for. The indicative request is not to be seen as an example of the underlying need of the particular simulated work task situation in use. A sim-2 version is constituted by the simulated work task situation , only. The exclusion of the indicative request (and definition) makes the sim-2 version more semantically open.
  • DE Conferentie 2004 Pia Borlund

    1. 1. Pia Borlund Department of Information Studies Royal School of Library and Information Science, Denmark [email_address] The IIR evaluation model: a framework for evaluation of interactive information retrieval (IIR) systems DEN conference, Arnhem, December 1-2 2004
    2. 2. Outline: <ul><li>Motivation for the development of the IIR evaluation model </li></ul><ul><li>The IIR evaluation model </li></ul><ul><ul><li>Objective & parts of the IIR evaluation model </li></ul></ul><ul><ul><ul><li>Part 1: test components </li></ul></ul></ul><ul><ul><ul><li>Part 2: recommendations </li></ul></ul></ul><ul><ul><ul><li>Part 3: alternative performance measures </li></ul></ul></ul><ul><ul><li>Strengths and weaknesses of the IIR evaluation model </li></ul></ul><ul><li>How can the cultural heritage sector use the IIR evaluation model? </li></ul>Pia Borlund DEN conference, Arnhem, December 1-2 2004
    3. 3. <ul><li>The primary objective of IR systems evaluation is the measurement of effectiveness </li></ul><ul><li>… is about how well system is at retrieving all the relevant documents, and at the same time retrieving as few non-relevant documents </li></ul><ul><li>… is a measure of how well the system performs – hence, performance measures </li></ul>Objective of IR systems evaluation: Pia Borlund DEN conference, Arnhem, December 1-2 2004
    4. 4. <ul><li>Simple illustration of the parts/agents involved in the IR process: </li></ul>Main approaches to IR research, and IR systems evaluation: IR system: documents/ representations Intermediary (person or interface) User with information need Q R Q : Query R : Request The objective of the IR process is to obtain an appropriate match (harmony) between the involved parts/agents as to the satisfaction of the user’s information need System-driven approach to IR User-oriented approach to IR Cognitive viewpoint Pia Borlund DEN conference, Arnhem, December 1-2 2004
    5. 5. The Cranfield model: <ul><li>The Cranfield model derives directly from the Cranfield II experiment: </li></ul><ul><ul><li>‘ principle of test collections ’ (a collection of documents; a collection of queries; and a collection of relevance assessments) </li></ul></ul><ul><ul><li>recall/precision </li></ul></ul><ul><li>Test characteristics: </li></ul><ul><li>Controlled laboratory test, no user participation </li></ul><ul><li>(Requests)/queries = information need </li></ul><ul><li>Batch mode (1 search run) = static information need, and relevance </li></ul><ul><li>Objective, topical, and binary relevance </li></ul><ul><li>Experimental control </li></ul>Pia Borlund DEN conference, Arnhem, December 1-2 2004
    6. 6. <ul><li>“ The conflict between laboratory and operational </li></ul><ul><li>experiments is essentially a conflict between, on the </li></ul><ul><li>one hand, control over experimental variables, </li></ul><ul><li>observability, and repeatability, and on the other hand, </li></ul><ul><li>realism.” </li></ul><ul><li>(Robertson & Hancock-Beaulieu, 1992, p. 460) </li></ul>System-driven vs . user-oriented approach: Pia Borlund DEN conference, Arnhem, December 1-2 2004
    7. 7. <ul><li>IR systems have become more interactive!! </li></ul><ul><li>Criticisms against the conventional methods, e.g.: </li></ul><ul><ul><li>Real end-users are rarely involved </li></ul></ul><ul><ul><li>The information need is assumed static throughout the experiment </li></ul></ul><ul><ul><li>Binary and topical relevance types are used </li></ul></ul><ul><li>1) The application of simulated work task situations : </li></ul><ul><ul><li>Realistic information searching & retrieval processes </li></ul></ul><ul><ul><li>Experimental control </li></ul></ul><ul><li>2) To measure performance by use of non-binary based performance measures : </li></ul><ul><ul><li>Realistic assessment behaviour </li></ul></ul><ul><ul><li>Indication of users’ subjective impression of system performance and satisfaction of information need </li></ul></ul>Motivation for the IIR model: Facts: Solution: Pia Borlund DEN conference, Arnhem, December 1-2 2004
    8. 8. Outline: <ul><li>Motivation for the development of the IIR evaluation model </li></ul><ul><li>The IIR evaluation model </li></ul><ul><ul><li>Objective & parts of the IIR evaluation model </li></ul></ul><ul><ul><ul><li>Part 1: test components </li></ul></ul></ul><ul><ul><ul><li>Part 2: recommendations </li></ul></ul></ul><ul><ul><ul><li>Part 3: alternative performance measures </li></ul></ul></ul><ul><ul><li>Strengths and weaknesses of the IIR evaluation model </li></ul></ul><ul><li>How can the cultural heritage sector use the IIR evaluation model? </li></ul>Pia Borlund DEN conference, Arnhem, December 1-2 2004
    9. 9. <ul><li>The aim of the IIR evaluation model is two-fold: </li></ul><ul><li>To facilitate evaluation of IIR systems as realistically as possible with reference to actual information searching and retrieval processes, though still in a relatively controlled evaluation environment; and </li></ul><ul><li>To calculate the IIR system performance taking into account the non-binary nature of the assigned relevance assessments and respecting the different types of relevance </li></ul>Objective of the IIR evaluation model: Pia Borlund DEN conference, Arnhem, December 1-2 2004
    10. 10. <ul><li>The model consists of 3 parts: </li></ul><ul><li>A set of test components which aims at ensuring a functional, valid, and realistic setting for the evaluation of IIR systems </li></ul><ul><li>Empirical based recommendations for the application of the proposed sub-component, the concept of simulated work task situations ; and </li></ul><ul><li>3) Alternative performance measures , e.g.,: </li></ul><ul><ul><li>The measure of Relative Relevance (RR) </li></ul></ul><ul><ul><li>The performance indicator of Ranked Half-Life (RHL) </li></ul></ul><ul><ul><li>The measure of Cumulative Gain (CG) </li></ul></ul><ul><ul><li>The measure of Cumulative Gain with Discount (DCG) </li></ul></ul>Parts of the IIR evaluation model: Pia Borlund DEN conference, Arnhem, December 1-2 2004
    11. 11. Outline: <ul><li>Motivation for the development of the IIR evaluation model </li></ul><ul><li>The IIR evaluation model </li></ul><ul><ul><li>Objective & parts of the IIR evaluation model </li></ul></ul><ul><ul><ul><li>Part 1: test components </li></ul></ul></ul><ul><ul><ul><li>Part 2: recommendations </li></ul></ul></ul><ul><ul><ul><li>Part 3: alternative performance measures </li></ul></ul></ul><ul><ul><li>Strengths and weaknesses of the IIR evaluation model </li></ul></ul><ul><li>How can the cultural heritage sector use the IIR evaluation model? </li></ul>Pia Borlund DEN conference, Arnhem, December 1-2 2004
    12. 12. <ul><li>The involvement of potential users as test persons </li></ul><ul><li>The application of individual and potentially dynamic information need interpretations deriving from, e.g., simulated work task situations; and </li></ul><ul><li>The assignment of multidimensional and dynamic relevance judgements </li></ul>Part 1: test components: Pia Borlund DEN conference, Arnhem, December 1-2 2004
    13. 13. <ul><li>… for the application of simulated work task situations:   </li></ul><ul><li>To employ both the simulated, and real information needs within the same test </li></ul><ul><li>To tailor the simulated work task situations toward the test persons with reference to: </li></ul><ul><ul><li>a situation the test persons can relate to easily and with which they can identify themselves; </li></ul></ul><ul><ul><li>a situation that the test persons find topically interesting; and </li></ul></ul><ul><ul><li>a situation that provides enough imaginative context in order for the test persons to be able to apply the situation </li></ul></ul><ul><li>To permute the order of search jobs between the test persons </li></ul><ul><li>To pilot test prior to actual testing </li></ul>Part 2: recommendations: Pia Borlund DEN conference, Arnhem, December 1-2 2004
    14. 14. <ul><li>… motivation: </li></ul><ul><li>Evaluation of IIR systems – the involvement of human beings: </li></ul><ul><li>Mixture of different types of objective and subjective relevance assessments </li></ul><ul><li>The assignment of non-binary relevance assessments </li></ul><ul><li>Scattered distributions of user-generated relevance assessments </li></ul><ul><li>Relative Relevance (RR) measure: </li></ul><ul><li>Associative measure of agreement between types of relevance </li></ul><ul><li>Ranked Half-Life (RHL) indicator: </li></ul><ul><li>Positional indicator of ranked retrieval results </li></ul><ul><li>… The proposed measures are to supplement, not substitute recall and precision </li></ul>Part 3: alternative performance measures: Pia Borlund DEN conference, Arnhem, December 1-2 2004
    15. 15. Strengths and weaknesses of the IIR evaluation model: <ul><li>Strengths: </li></ul><ul><li>Realism </li></ul><ul><li>IR + searching behaviour </li></ul><ul><li>Real + simulated information needs </li></ul><ul><li>Subjective, non-binary, potentially dynamic relevance </li></ul><ul><li>Alternative performance measures + recall/precision </li></ul><ul><li>Experimental control </li></ul><ul><li>Repeatable, but not necessarily with identical results </li></ul><ul><li>Weaknesses: </li></ul><ul><li>Resource demanding (manpower + time) </li></ul><ul><li>Requires domain knowledge </li></ul><ul><li>Requires design and test of simulated work task situations </li></ul><ul><li>Lack of comparability of performance measure results – due to subjective assessments </li></ul>Pia Borlund DEN conference, Arnhem, December 1-2 2004
    16. 16. Outline: <ul><li>Motivation for the development of the IIR evaluation model </li></ul><ul><li>The IIR evaluation model </li></ul><ul><ul><li>Objective & parts of the IIR evaluation model </li></ul></ul><ul><ul><ul><li>Part 1: test components </li></ul></ul></ul><ul><ul><ul><li>Part 2: recommendations </li></ul></ul></ul><ul><ul><ul><li>Part 3: alternative performance measures </li></ul></ul></ul><ul><ul><li>Strengths and weaknesses of the IIR evaluation model </li></ul></ul><ul><li>How can the cultural heritage sector use the IIR evaluation model? </li></ul>Pia Borlund DEN conference, Arnhem, December 1-2 2004
    17. 17. How can the cultural heritage sector use the IIR evaluation model? <ul><li>Investigation of information seeking and searching behaviour of cultural heritage by use simulated work task situations </li></ul><ul><li>Specification of requirements for new systems by use of simulated work task situations based on information needs and information seeking/searching behaviour </li></ul><ul><li>Performance and/or usability tests of existing systems -- the complete model, or parts of the model </li></ul>Pia Borlund DEN conference, Arnhem, December 1-2 2004
    18. 18. <ul><li>Thank you !! </li></ul>Pia Borlund DEN conference, Arnhem, December 1-2 2004
    19. 19. References: <ul><li>Borlund, P. (2000). Evaluation of interactive information retrieval systems . Åbo: Åbo Akademi University Press. Doctoral thesis. </li></ul><ul><li>Cleverdon, C.W. and Keen, E.M. (1966). Aslib Cranfield research project: factors determining the performance of indexing systems. Vol. 2: results. Cranfield. </li></ul><ul><li>Cleverdon, C.W., Mills, J. and Keen, E.M. (1966). Aslib Cranfield research project: Factors determining the performance of indexing systems. Vol. 1: design. Cranfield. </li></ul><ul><li>Järvelin, K. and Kekäläinen, J. (2000). IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23 rd ACM Sigir Conference on Research and Development of Information Retrieval. Athens, Greece, 2000 . Pp. 41-48. New York, N.Y.: ACM Press . </li></ul><ul><li>Robertson, S.E. and Hancock-Beaulieu, M.M. (1992). On the evaluation of IR systems. In: Information Processing & Management , 28 (4), pp. 457-466. </li></ul>Pia Borlund DEN conference, Arnhem, December 1-2 2004
    20. 20. Contingency table – performance measures: Pia Borlund DEN conference, Arnhem, December 1-2 2004
    21. 21. <ul><li>Short ‘cover story’ that describes a situation which leads to IR </li></ul><ul><li>… Serves 2 main functions, it: </li></ul><ul><li>Triggers the simulated information need </li></ul><ul><li>Is the platform against which situational relevance is assessed </li></ul><ul><li>… More specifically, it describes: </li></ul><ul><li>The source of the information need </li></ul><ul><li>The environment of the situation </li></ul><ul><li>The problem which has to be solved, and </li></ul><ul><li>Serves to make the test person understand the objective of the search </li></ul><ul><li>… Further, by being the same for all the test persons experimental control is provided </li></ul><ul><li>… As such the concept of simulated work task situations ensures the experiment both realism and control </li></ul>Definition of sim ulated work task situation : Pia Borlund DEN conference, Arnhem, December 1-2 2004
    22. 22. Example of simulated situation / simulated work task situation (Kilde: Borlund, 2000) Pia Borlund DEN conference, Arnhem, December 1-2 2004

    ×