1. An Evaluation Scheme for Hierarchical
Information Browsing Structures
Abstract
Megan Richardson
There is no widely accepted means of evaluating
New Mexico State University
category systems for information search and browsing.
P.O. Box 30001
This presentation outlines an evaluation scheme and an
Las Cruces, NM 88003-8001
evaluation method that applies the scheme. The
merichar@nmsu.edu
scheme delineates features broadly classified under
comprehensiveness, coherence, and correctness. The
method evaluates the category system through a
survey distributed among subject domain experts. The
method requires minimal resources, is easily conducted
remotely, and is easily modified. The approach finds
the over- and under-sensitivities of the method of
generating the system. A case study has demonstrated
the usefulness of the approach, and the inter-rater
reliability found suggests that the evaluation scheme is
meaningful.
Keywords
Evaluation, search, browsing, navigation, category
systems, hierarchies, information architecture,
heuristics, usability, design, human factors
ACM Classification Keywords
Copyright is held by the author/owner(s). H.5.2. Information interfaces and presentation: User
CHI 2008, April 5 – April 10, 2008, Florence, Italy. interfaces – Evaluation/methodology, User-centered
ACM 978-1-60558-012-8/08/04. design. H.3.3. Information storage and retrieval:
Information search and retrieval – Search process.
2. Introduction with other desirable features of a system, such as
This presentation describes an evaluation scheme (or usability or preferability, and they often do not provide
schema) for hierarchical information search and diagnostic information for use in system development.
browsing systems, a study method that applies the Furthermore, most of the metrics in this class apply
scheme, and a case study that establishes the reliability better to clusters than to hierarchies, having been
of the scheme. developed for use with document retrieval and similar
applications where output can meaningfully be
Because of the large volume of digital information regarded as unstructured bags of words.
available, designers of information browsing systems
turn to automation where possible; yet there is no Iterative task-oriented usability testing can assess the
standard method for evaluating the systems that usefulness of a category system for completion of
provide this automation. Evaluations can guide search tasks. These methods, however, can be costly
development and demonstrate the value of a system to and time-consuming.
stakeholders. The technique described here allows for
quick, inexpensive evaluations retaining the best IR metrics are intrinsic metrics taken using automatic
features of existing evaluation methods. methods. Usability criteria are extrinsic metrics taken
using empirical methods. Table 1 classifies some
Background: Category system evaluation category system evaluation metrics in terms of whether
There is no widely accepted means of evaluating the methods are automatic (ad hoc, implemented in
category systems for search and browsing. Several software) or empirical (post hoc, involving surveying or
metrics have been applied, but each has drawbacks. testing with representative human participants), and
intrinsic (expressing purportedly desirable properties of
Since category system creation has some historical the system) or extrinsic (assessing the usefulness of
roots in the field of information retrieval (IR), the category system in completion of tasks).
traditional IR evaluation metrics have been applied
[e.g., 2]. These include precision (the proportion of A system will have good features and bad features and
relevant items retrieved to the total number of items it is usually meaningless to average scores across
retrieved), recall (the proportion of relevant items features. The key to evaluation is provision of specific
retrieved to the total number of relevant items in the feedback. The method presented here does provide
collection), and combinations thereof such as F- specific feedback, and this feedback has been shown to
measure (the harmonic mean of precision and recall). be useful, presumably because it derives from a
Such automatic measurements are highly practical in meaningful analysis of the relevant features of an
that they easily can be retaken with each iteration of information browsing system. The method combines
development of the system. Unfortunately, most such the ease of automatic methods and the validity of
measurements have not been definitively correlated usability testing.
3. Automatic Methods Empirical Methods
Comparison to Ideal System User Evaluation (includes the present study)
Intrinsic
Metrics • precision and recall • percent correct relationship between pairs [12]
• average uninterpolated precision [10] • understandability [11]
• similarity as ratio of pairs common to two • accuracy [5, 9]
• preferences [4, 6, 12, 15]
hierarchies to total pairs in one hierarchy [7]
• similarity as category distance [14]
• path length [8]
Usability Testing
Extrinsic
Metrics • task completion rate
• time to completion
Table 1. Category system evaluation metrics.
The evaluation scheme: comprehensiveness, A study method
coherence and correctness The study technique presented here was used in the
The scheme includes evaluators’ overall impressions of test case and was found to be effective. It is important
the extent to which the system captures the important to note, however, that it is but one possible application
concepts for the collection. This information is of the scheme. Traditional heuristic evaluations,
collected in the study method described through direct usability studies, and card sort evaluations could be
solicitation of participant opinion. The scheme also based on the scheme. Where gold standard data are
takes account of specific features of the system: its available, automatic methods could be based on the
completeness, correct naming (or labeling) of scheme.
categories, appropriate differentiation of categories,
correct depth and scope of categories, and the The study is conducted via an online survey.
coherence and correct placement of subcategories. Participants are shown the category system and asked
These features are assessed in the study method by about the top-level categories. Then participants are
asking participants about specific changes they would asked to look at the second-level categories under each
make at particular points in the structure. Since of two headings, and at a pre-selected sample of lower-
responses are categorical, the structure’s rating on level categories, and asked about those. Examples of
each feature consists in the rating given by the most the questions asked are provided in Table 2. (A
participants. Results indicate over- and under- complete list of questions is available on request).
sensitivities of the method used to produce the Participants report subjective ratings on a four-point
structure. Table 2 presents the scheme. scale. Specific information is also recorded on changes
the participants would make to the category structure.
4. General Feature Specific Feature Sample Questions
[at top level] [Do] these categories capture the important concepts for this collection[?]
Capture of important concepts
Comprehensiveness
[at top level] Would you add any top-level categories?
Completeness
[for subcategories] Would you add any subcategories to this category?
Are there any categories you would split into two, three, or more categories?
Category coherence
Coherence
Are there any categories you would merge together?
Differentiation
[for subcategories] Would you promote any subcategories to top-level?
Depth
[for subcategories]
Placement
Would you move any subcategories to a different existing top-level category?
[at top level] Are there any categories you would keep, but rename?
Correct naming
Correctness
[at top level] Would you remove any top-level categories?
Scope (exclusion of misfit
concepts) [for subcategories] Would you remove any subcategories entirely?
Table 2. The evaluation scheme.
Modes of the scaled responses are found. The search and browsing system requires that metadata be
comprehensiveness, coherence, and correctness of assigned to items in the collection, specifying their
categories are assessed primarily by looking at the facets. Because such browsing systems are specifically
extent to which participants would add, remove, move, intended for use with large collections, hand-coding of
or rename categories. this metadata is costly; hence the need for algorithms
that automatically assign metadata. Castanet is an
Inter-rater agreement is calculated using Kendall’s algorithm that assigns facet and hierarchy metadata
coefficient of concordance (W). Scales are treated as [13].
categorical responses. Significance of difference from a
null hypothesis of maximal rating can be determined The study
using chi-square tests. Further between-subjects Castanet was evaluated using the present scheme and
analyses are possible, but these were conducted in the method. Two datasets were used: a list of 3275
test case and produced no informative results. journal titles taken from MEDLINE citations, and a
Comparative evaluations can be conducted by finding collection of 13000 recipes found on web sites. These
any significant differences in responses to two systems. collections were previously used in Castanet pilots and
were found to have fewer ambiguities than other
Test case and scheme reliability collections [13].
Castanet
Facets are orthogonal descriptors used to categorize The questionnaire was first tested with two users, an
items in a collection. If a facet is hierarchical rather expert in each subject domain. Critiques were solicited
than flat, a label may have other labels beneath it in of the study materials, especially concerning the clarity
the structure. The creation of a faceted hierarchical and number of questions and answer choices.
5. Best Intermediate Ratings Worst
Since in studies of search it is important to use
Rating Rating
participants with interest in the subject matter [2], e.g., e.g., e.g.,
participation was solicited on email discussion lists Strongly Agree Disagree Strongly
agree somewhat, somewhat disagree
dedicated to the relevant subjects. In the evaluation
Overall
under discussion, there were 49 participants.
Impressions
Participants were entered in a drawing for a gift Important Concepts 6% 18% 7%
69%
certificate for Amazon.com. Intuitiveness 23% 26% 14%
38%
Specific
Changes
Results and discussion
Completeness 32% 15% 8%
46%
Results are summarized in Table 3. Mode responses Category Coherence 14% 3% 2%
81%
indicate that participants would not make changes. For Differentiation 29% 20% 2%
49%
Correct Depth 22% 4% 0%
74%
one database, however, Castanet did poorly in naming
Correct Placement 26% 9% 3%
63%
and created too many top-level categories. These
Correct Naming 29% 22% 5%
45%
results led Castanet’s creator to focus on the
Scope 31% 9% 6%
54%
algorithm’s handling of synonymy [Stoica, personal
Table 3. Summary of ratings for Castanet, averaged over
communication].
biomedical and recipe conditions, rounded to nearest percent.
Percentage of participants giving the mode response in bold.
Despite otherwise good ratings, ratings of Castanet’s
intuitiveness were thinly distributed across the scale.
Advantages of the approach
Since usefulness and acceptability do not require that a
The approach evaluates the means of generating the
system match human intuition [1], measures thereof
browsing structure, not merely the browsing structure
were removed from the scheme based on this finding.
itself. It finds the nature of weaknesses of the method
of generating the structure, whether the method is
Reliability of the evaluation scheme
automatic (an algorithm), manual, or both. The
Reliability of a coding scheme is based on agreement
evaluation need not cover all parts of a system to
between annotators [3]. A high (but sub-maximal)
pinpoint important problems. The feedback provided is
inter-rater agreement suggests that the scheme is
useful for improving the system under consideration.
neither idiosyncratic nor trivial and that raters have
some shared understanding of the categories in the
Importantly, the scheme has been established as
scheme. In the test case, inter-rater agreement was
reliable by the test case.
significant at W = 0.4309 (Х2(4, N = 49) = 63.7695, p
< 0.0001). This inter-rater reliability suggests that the
The study method requires minimal resources, time,
chosen evaluation scheme is meaningful.
and effort, requiring only a judicious selection of test
categories (done in the manner of task selection for
Furthermore, the scheme’s usefulness is suggested by
usability studies), a simple survey set-up, and basic
the reported applicability of results.
6. [6] Kummamuru, K., Lotlikar, R., Roy, S., Singal, K.,
statistical analyses. This type of study is easily done
and Krishnapuram, R. A hierarchical monothetic
remotely and could be modified to be more task-based
document clustering algorithm for summarization and
while retaining its desirable features.
browsing search results. In Proc. of the 13th
International Conference on World Wide Web, ACM
Future work Press (2004), 658-665.
Iterative evaluation should be conducted during
[7] Lawrie, D. and Croft, W.B. Discovering and
development of a system to establish the point of comparing topic hierarchies. In Proc. RIAO 2000,
diminishing returns. The development of anchored Elsevier (2000), 314-330.
scales for use in this approach would allow for use of [8] Lawrie, D. and Croft, W.B. Generating hierarchical
further statistical analyses. Research into automatic summaries for web searches. In Proc. SIGIR 2003,
ACM Press (2003), 457-458.
category creation in general stands to benefit from the
creation of gold standard datasets, which might be [9] Li, T., Zhu, S., and Ogihara, M. Topic hierarchy
accomplished through a large-scale card sort study. generation via linear discriminant projection. In Proc.
SIGIR 2003, ACM Press (2003), 421-422.
Acknowledgements [10] Nanas, N., Uren, V., and de Roeck, A. Building and
applying a concept hierarchy representation of a user
I want to thank Emilia Stoica, Lisa J. Elliott, Sara
profile. In Proc. SIGIR 2003, ACM Press (2003), 198-
Gilliam and Marti Hearst. This material is based in part
204.
on work supported by the NSF CISE Directorate.
[11] Pirolli, P., Schank, P., Hearst, M., and Diehl, C.
Scatter/Gather browsing communicates the topic
References
structure of a very large text collection. In Proc.
[1] Amigó, E., Giménez, J., Gonazlo, J., and Màrquez,
SIGCHI 1996, ACM Press (1996), 213-220.
L. MT evaluation: Human-like vs. human acceptable.
In Proc. COLING/ACL 2006, ACL (2006), 17-24. [12] Sanderson, M. and Croft, B. Deriving concept
hierarchies from text. In Proc. SIGIR 1999, ACM Press
[2] Borlund, P. and Ingwersen, P. The development of
(1999), 206-213.
a method for the evaluation of interactive information
retrieval systems. In Journal of Documentation 53, 3 [13] Stoica, E. and Hearst, M. Nearly-automated
(1997), 225-250. hierarchy creation. In Proc. HLT/NAACL 2004, ACL
(2004), 117-120.
[3] Craggs, R. and Wood, M. Evaluating discourse and
dialogue coding schemes. Computational Linguistics [14] Sun, A. and Lim, E.-P. Hierarchical text
31, 3 (2005), 289-296. classification and evaluation. In Proc. of the 1st IEEE
International Conference on Data Mining, IEEE
[4] English, J., Hearst, M., Sinha, R., Swearingen, K.,
Computer Society (2001), 521-528.
and Yee, K. Hierarchical faceted metadata in site
search interfaces. In Communications of the ACM 45, 9 [15] Wu, Y.B., Shankar, L., and Chen, X. Finding more
(2002), 628-639. useful information faster from web search results. In
Proc. CIKM 2003, ACM Press (2003).
[5] Krowne, A. and Halbert, M. An initial evaluation of
automated organization for digital library browsing. In
Proc. JCDL 2005, ACM Press (2005), 246-255.