[thank you. Know your work, don’t know you that well]Two groups of challenges that I would like us to do something about:Conceptual. The kinds of concepts that we explore, the principled dimensions in our comparisons, the design concepts behind our systems. Practical. The little details in experiments that matter.Finally, sounds like a critique, but we are all in the same boat.Abstract:Research in information visualization (InfoViz) has developed considerably during the last 25 years. In particular, the field is now informed by a substantial and growing literature on evaluations of visualizations. To keep advancing InfoViz, I believe we need to address two limitations of our evaluations. On the one hand, few empirical studies are motivated by theory or are comparing equally plausible hypotheses. Mostly, the InfoViz literature proposes radical innovations (in the terms of William Newman) and does little to develop and test concepts. On the other hand, many of the practical, low-level decisions in InfoViz evaluations are problematic. Like most HCI researchers, we evaluate our own interfaces, use mostly simple outcome measures, rarely study the process of interaction, and select tasks somewhat randomly. This talk will outline the conceptual and practical challenges of evaluation and begin a discussion of how to overcome them.
1. More papers with evaluations. Could be more [lam]. We have gotten much right.2. Many concerns remain. Beliv workshops works on these. Munzner.3. Perhaps these concerns are less special to information visualization than one should think4. Usability evaluation has have many methodological discussions. Scorching review by Gray&Salzman; CHI review.
Win-lose evaluations. No plausible alternative. Part of what Greenberg and Buxton objected to: putting evaluation tails on more or less innovative ideas. Point designs: no mapping out of design spaces; the design and the evaluation stands as a more or less random idea.Clarity in key concepts. Focus+context interfaces vs fisheye interfacesLack of theory-driven experiments: maybe related to perception, not outside[let me give some examples]
Read definitions2. Many other definitions (e.g., Shneiderman The eyes have it) 3. Quite a confusing surrounding the term
How is overview used?[we do not know much about the psychological dimension of overview, nor do we know much about the link between technical and psychological aspects of overviewing][lack of data or clear comparisons about key questions]
So what do we know about fisheye interfaces?
Fisheye views in Eclipse
Study the adoption and integration.
1. Psychological issue at stake: the new yorker’s view of the world?
The notion of strong inference2. A lot of Platt’s argument is weak. [why do we not answer such questions?] (a) lack of true alternatives, (b) few interesting, negative resultsHeer: analytic framework [Weaver: what would be the strong inference test of interaction with coordinated windows?]
% is all studies, not only empirical ones. Much lower in engineering.One characteristic of radical solutions is that the do not seek to develop or improve on existing methods, techniques or tools
Let me show youhowthese studies measureusability.Thischart shows the effectivenessmeasuresused in the 104Effectiveness is typicallymeasured as error rates orbinarytaskcompletion.Few studies usecomplexmeasures of effectiveness, such as assessments of the quality of the outcome of the interactionorexpertsassessment[effectivenessmeasures, mainly the boring ones. But wewant the more complexones]
Task completion time is a summary measure of the processInterpretation hard.
Q&Atask2.The overview+detail interface is approximately 20% slower (around a minute) than linear interface. There is nosignificant difference between the fisheye and the other interfaces.
Summary measures,Need to be contextualized by qualitative data on the interaction processSome measures that we treat as outcome measures are really about process.
1. The point here is that we make rich process data to understand the process summary measure //
Few replications and a lot of problems…
[typical two questions after having used an interface. Cross out the scale. Compare similar ratings for, say, two prototypes]
We do not know: - how they relate? - if they measure say, sense of control, in a valid way.[we can compare this variety of words to established questionnaires. And when we do that, we see that the reliability of questionnaires that people invent themselves are lower]
Tasks effects are strong, perhaps second only to individual differences
???Mostly be about experiments. Some apply to other types. Not to belittle other research approaches.Beyond dogmas (cf. Hornbæk 2010)
BELIV'10 Keynote: Conceptual and Practical Challenges in InfoViz Evaluations
Conceptual and practical challenges in InfoViz evaluations<br />Kasper Hornbæk<br />
Concerns in evaluation<br />Increase in InfoViz studies that include evaluation<br />But concerns over those studies remain, see<br />The BELIV workshops<br />Reviews of InfoViz papers (e.g., Munzner 2008)<br />Concerns are shared by research in <br />Usability evaluation (e.g., Gray & Salzman, 1998)<br />CHI (e.g., Greenberg & Buxton 2008, PC meeting)<br />
Conceptual challenges<br />Observations about (some) InfoViz studies:<br />Win-lose evaluations<br />Point designs<br />Lack of clarity in key constructs<br />Lack of theory-driven experiments<br />
Example: The concept of overview<br />An overview<br />“is constructed from, and represents, a collection of objects of interest” (Greene et al., 2000, p. 381)<br />“implies a qualitative awareness of one aspect of some data, preferably acquired rapidly and, even better, pre-attentively: that is, without cognitive effort” (Spence, 2007, p.19)<br />Hornbæk & Hertzum (under review)<br />
The concept of overview, cont’<br />1391 occurrences of overview in 60 studies analyzed<br />Used in three senses:<br />81% technical (overview+detail visualization)<br />14% user-centered (user forming an overview)<br />5% other ( “give an overview of the literature”)<br />Does an overview give an overview?<br />Hornbæk & Hertzum (under review)<br />
Example: Fisheye interfaces<br />DOI function composed of a priori interest and distance components<br />Mixed empirical results (e.g., Cockburn et al. 2008)<br />some optimistic (e.g., papers at this CHI), <br />some pessimistic (e.g., Lam & Munzner 2008)<br />
Jakobsen & Hornbæk, “Evaluating a fisheye view of source code”, CHI (2006)<br />
Focus area<br />Context view<br />Jakobsen & Hornbæk, “Fisheyes in the field …”, CHI (2009)<br />
Fisheye interfaces, cont’<br />Is a priori interest useful?<br />Our data suggests maybe<br />Hard to understand a priori components of the DOI function<br />Appears less useful in real-life programming than lines linked directly to the users’ focus<br />But no one (?!) has generated direct evidence about this question<br />Jakobsen & Hornbæk (under review)<br />
Strong inference<br />Platt (1964) proposed the notion of strong inference, with the steps of<br />Devising alternative hypotheses<br />Devising a crucial experiment which will rule out some hypotheses<br />Carry out the experiment<br />
Radical solutions<br />Newman (1994) reviewed CHI publications and found 25% radical solutions<br />
Simple outcome measures<br />Hornbæk, K. (2006), “Current Practice in Measuring Usability …”, IJHCS, 64 (2), 79-102<br />
Simple process measures<br />57% of studies measure task completion time<br />Task completion time work as a summary measure of the process<br />How should we interpret task completion time?<br />Hornbæk, K. (2006), “Current Practice in Measuring Usability …”, IJHCS, 64 (2), 79-102<br />
Standing on the shoulders of giants<br />A key challenge is to build on the work of others in selecting:<br />Data<br />Tasks<br />Measures<br />Interfaces<br />Crucial for replications, accumulating knowledge, validity/reliability<br />
Reliability of questionnaires<br />Homegrown questionnaires have low reliability, 6 are below .7<br />Hornbæk & Law, CHI (2007)<br />
Example: Selecting tasks<br />Tasks are crucial (thanks Catherine)<br />But chosen ad hoc, to match evaluation, or habitually<br />Which task would you use to evaluate an overview?<br />Monitoring rare, navigation frequent (Hornbæk & Hertzum, under review)<br />
What to do?<br />More strong, theoretically motivated comparisons (TILCS?) <br />More complex measures of outcome and process, coupled with richer data<br />Build on existing work, replications<br />Task taxonomies, do task-level analysis, do field work<br />Studies of adoption and integration<br />
Construct validity<br />From Adcock, R. & Collier, D. (2001), Measurement Validity: A shared standard …, American Political Science Review, 95(3), 529-546<br />
Results: are measures correlated?<br />Effectiveness vs. efficiency: r = .247 ± .059<br />May be interpreted as:<br />6% variance explained<br />small (~.1) to medium (~.3) effect (Cohen 1969)<br />r = .229<br />r = .23<br />
Results: are measures correlated?<br />Task complexity does not influence the correlations<br />More complex measures (e.g., quality of outcome) attenuate correlations<br />Difference between errors-along-the-way (.441) and task-completion-errors (.155)<br />