SlideShare a Scribd company logo
1 of 6
Download to read offline
An Evaluation Scheme for Hierarchical
                                 Information Browsing Structures

                                                       Abstract
Megan Richardson
                                                       There is no widely accepted means of evaluating
New Mexico State University
                                                       category systems for information search and browsing.
P.O. Box 30001
                                                       This presentation outlines an evaluation scheme and an
Las Cruces, NM 88003-8001
                                                       evaluation method that applies the scheme. The
merichar@nmsu.edu
                                                       scheme delineates features broadly classified under
                                                       comprehensiveness, coherence, and correctness. The
                                                       method evaluates the category system through a
                                                       survey distributed among subject domain experts. The
                                                       method requires minimal resources, is easily conducted
                                                       remotely, and is easily modified. The approach finds
                                                       the over- and under-sensitivities of the method of
                                                       generating the system. A case study has demonstrated
                                                       the usefulness of the approach, and the inter-rater
                                                       reliability found suggests that the evaluation scheme is
                                                       meaningful.

                                                       Keywords
                                                       Evaluation, search, browsing, navigation, category
                                                       systems, hierarchies, information architecture,
                                                       heuristics, usability, design, human factors

                                                       ACM Classification Keywords
Copyright is held by the author/owner(s).              H.5.2. Information interfaces and presentation: User
CHI 2008, April 5 – April 10, 2008, Florence, Italy.   interfaces – Evaluation/methodology, User-centered
ACM 978-1-60558-012-8/08/04.                           design. H.3.3. Information storage and retrieval:
                                                       Information search and retrieval – Search process.
Introduction                                                with other desirable features of a system, such as
This presentation describes an evaluation scheme (or        usability or preferability, and they often do not provide
schema) for hierarchical information search and             diagnostic information for use in system development.
browsing systems, a study method that applies the           Furthermore, most of the metrics in this class apply
scheme, and a case study that establishes the reliability   better to clusters than to hierarchies, having been
of the scheme.                                              developed for use with document retrieval and similar
                                                            applications where output can meaningfully be
Because of the large volume of digital information          regarded as unstructured bags of words.
available, designers of information browsing systems
turn to automation where possible; yet there is no          Iterative task-oriented usability testing can assess the
standard method for evaluating the systems that             usefulness of a category system for completion of
provide this automation. Evaluations can guide              search tasks. These methods, however, can be costly
development and demonstrate the value of a system to        and time-consuming.
stakeholders. The technique described here allows for
quick, inexpensive evaluations retaining the best           IR metrics are intrinsic metrics taken using automatic
features of existing evaluation methods.                    methods. Usability criteria are extrinsic metrics taken
                                                            using empirical methods. Table 1 classifies some
Background: Category system evaluation                      category system evaluation metrics in terms of whether
There is no widely accepted means of evaluating             the methods are automatic (ad hoc, implemented in
category systems for search and browsing. Several           software) or empirical (post hoc, involving surveying or
metrics have been applied, but each has drawbacks.          testing with representative human participants), and
                                                            intrinsic (expressing purportedly desirable properties of
Since category system creation has some historical          the system) or extrinsic (assessing the usefulness of
roots in the field of information retrieval (IR),           the category system in completion of tasks).
traditional IR evaluation metrics have been applied
[e.g., 2]. These include precision (the proportion of       A system will have good features and bad features and
relevant items retrieved to the total number of items       it is usually meaningless to average scores across
retrieved), recall (the proportion of relevant items        features. The key to evaluation is provision of specific
retrieved to the total number of relevant items in the      feedback. The method presented here does provide
collection), and combinations thereof such as F-            specific feedback, and this feedback has been shown to
measure (the harmonic mean of precision and recall).        be useful, presumably because it derives from a
Such automatic measurements are highly practical in         meaningful analysis of the relevant features of an
that they easily can be retaken with each iteration of      information browsing system. The method combines
development of the system. Unfortunately, most such         the ease of automatic methods and the validity of
measurements have not been definitively correlated          usability testing.
Automatic Methods                                         Empirical Methods

                Comparison to Ideal System                                User Evaluation (includes the present study)
 Intrinsic
 Metrics           • precision and recall                                     • percent correct relationship between pairs [12]
                   • average uninterpolated precision [10]                    • understandability [11]
                   • similarity as ratio of pairs common to two               • accuracy [5, 9]
                                                                              • preferences [4, 6, 12, 15]
                        hierarchies to total pairs in one hierarchy [7]
                   • similarity as category distance [14]
                   • path length [8]
                                                                          Usability Testing
 Extrinsic
 Metrics                                                                      • task completion rate
                                                                              • time to completion
Table 1. Category system evaluation metrics.



The evaluation scheme: comprehensiveness,                                 A study method
coherence and correctness                                                 The study technique presented here was used in the
The scheme includes evaluators’ overall impressions of                    test case and was found to be effective. It is important
the extent to which the system captures the important                     to note, however, that it is but one possible application
concepts for the collection. This information is                          of the scheme. Traditional heuristic evaluations,
collected in the study method described through direct                    usability studies, and card sort evaluations could be
solicitation of participant opinion. The scheme also                      based on the scheme. Where gold standard data are
takes account of specific features of the system: its                     available, automatic methods could be based on the
completeness, correct naming (or labeling) of                             scheme.
categories, appropriate differentiation of categories,
correct depth and scope of categories, and the                            The study is conducted via an online survey.
coherence and correct placement of subcategories.                         Participants are shown the category system and asked
These features are assessed in the study method by                        about the top-level categories. Then participants are
asking participants about specific changes they would                     asked to look at the second-level categories under each
make at particular points in the structure. Since                         of two headings, and at a pre-selected sample of lower-
responses are categorical, the structure’s rating on                      level categories, and asked about those. Examples of
each feature consists in the rating given by the most                     the questions asked are provided in Table 2. (A
participants. Results indicate over- and under-                           complete list of questions is available on request).
sensitivities of the method used to produce the                           Participants report subjective ratings on a four-point
structure. Table 2 presents the scheme.                                   scale. Specific information is also recorded on changes
                                                                          the participants would make to the category structure.
General Feature       Specific Feature                Sample Questions

                                                       [at top level] [Do] these categories capture the important concepts for this collection[?]
                       Capture of important concepts
 Comprehensiveness
                                                       [at top level] Would you add any top-level categories?
                       Completeness
                                                       [for subcategories] Would you add any subcategories to this category?
                                                       Are there any categories you would split into two, three, or more categories?
                       Category coherence
 Coherence
                                                       Are there any categories you would merge together?
                       Differentiation
                                                       [for subcategories] Would you promote any subcategories to top-level?
                       Depth
                                                       [for subcategories]
                       Placement
                                                       Would you move any subcategories to a different existing top-level category?
                                                       [at top level] Are there any categories you would keep, but rename?
                       Correct naming
 Correctness
                                                       [at top level] Would you remove any top-level categories?
                       Scope (exclusion of misfit
                       concepts)                       [for subcategories] Would you remove any subcategories entirely?
Table 2. The evaluation scheme.

Modes of the scaled responses are found. The                              search and browsing system requires that metadata be
comprehensiveness, coherence, and correctness of                          assigned to items in the collection, specifying their
categories are assessed primarily by looking at the                       facets. Because such browsing systems are specifically
extent to which participants would add, remove, move,                     intended for use with large collections, hand-coding of
or rename categories.                                                     this metadata is costly; hence the need for algorithms
                                                                          that automatically assign metadata. Castanet is an
Inter-rater agreement is calculated using Kendall’s                       algorithm that assigns facet and hierarchy metadata
coefficient of concordance (W). Scales are treated as                     [13].
categorical responses. Significance of difference from a
null hypothesis of maximal rating can be determined                       The study
using chi-square tests. Further between-subjects                          Castanet was evaluated using the present scheme and
analyses are possible, but these were conducted in the                    method. Two datasets were used: a list of 3275
test case and produced no informative results.                            journal titles taken from MEDLINE citations, and a
Comparative evaluations can be conducted by finding                       collection of 13000 recipes found on web sites. These
any significant differences in responses to two systems.                  collections were previously used in Castanet pilots and
                                                                          were found to have fewer ambiguities than other
Test case and scheme reliability                                          collections [13].
Castanet
Facets are orthogonal descriptors used to categorize                      The questionnaire was first tested with two users, an
items in a collection. If a facet is hierarchical rather                  expert in each subject domain. Critiques were solicited
than flat, a label may have other labels beneath it in                    of the study materials, especially concerning the clarity
the structure. The creation of a faceted hierarchical                     and number of questions and answer choices.
Best       Intermediate Ratings   Worst
Since in studies of search it is important to use
                                                                                  Rating                            Rating
participants with interest in the subject matter [2],                             e.g.,      e.g.,                  e.g.,
participation was solicited on email discussion lists                             Strongly   Agree       Disagree   Strongly
                                                                                  agree      somewhat,   somewhat   disagree
dedicated to the relevant subjects. In the evaluation
                                                             Overall
under discussion, there were 49 participants.
                                                             Impressions
Participants were entered in a drawing for a gift            Important Concepts     6%                     18%        7%
                                                                                               69%
certificate for Amazon.com.                                  Intuitiveness         23%         26%                   14%
                                                                                                           38%
                                                             Specific
                                                             Changes
Results and discussion
                                                             Completeness                      32%         15%        8%
                                                                                   46%
Results are summarized in Table 3. Mode responses            Category Coherence                14%          3%        2%
                                                                                   81%
indicate that participants would not make changes. For       Differentiation                   29%         20%        2%
                                                                                   49%
                                                             Correct Depth                     22%          4%        0%
                                                                                   74%
one database, however, Castanet did poorly in naming
                                                             Correct Placement                 26%          9%        3%
                                                                                   63%
and created too many top-level categories. These
                                                             Correct Naming                    29%         22%        5%
                                                                                   45%
results led Castanet’s creator to focus on the
                                                             Scope                             31%          9%        6%
                                                                                   54%
algorithm’s handling of synonymy [Stoica, personal
                                                            Table 3. Summary of ratings for Castanet, averaged over
communication].
                                                            biomedical and recipe conditions, rounded to nearest percent.
                                                            Percentage of participants giving the mode response in bold.
Despite otherwise good ratings, ratings of Castanet’s
intuitiveness were thinly distributed across the scale.
                                                            Advantages of the approach
Since usefulness and acceptability do not require that a
                                                            The approach evaluates the means of generating the
system match human intuition [1], measures thereof
                                                            browsing structure, not merely the browsing structure
were removed from the scheme based on this finding.
                                                            itself. It finds the nature of weaknesses of the method
                                                            of generating the structure, whether the method is
Reliability of the evaluation scheme
                                                            automatic (an algorithm), manual, or both. The
Reliability of a coding scheme is based on agreement
                                                            evaluation need not cover all parts of a system to
between annotators [3]. A high (but sub-maximal)
                                                            pinpoint important problems. The feedback provided is
inter-rater agreement suggests that the scheme is
                                                            useful for improving the system under consideration.
neither idiosyncratic nor trivial and that raters have
some shared understanding of the categories in the
                                                            Importantly, the scheme has been established as
scheme. In the test case, inter-rater agreement was
                                                            reliable by the test case.
significant at W = 0.4309 (Х2(4, N = 49) = 63.7695, p
< 0.0001). This inter-rater reliability suggests that the
                                                            The study method requires minimal resources, time,
chosen evaluation scheme is meaningful.
                                                            and effort, requiring only a judicious selection of test
                                                            categories (done in the manner of task selection for
Furthermore, the scheme’s usefulness is suggested by
                                                            usability studies), a simple survey set-up, and basic
the reported applicability of results.
[6] Kummamuru, K., Lotlikar, R., Roy, S., Singal, K.,
statistical analyses. This type of study is easily done
                                                           and Krishnapuram, R. A hierarchical monothetic
remotely and could be modified to be more task-based
                                                           document clustering algorithm for summarization and
while retaining its desirable features.
                                                           browsing search results. In Proc. of the 13th
                                                           International Conference on World Wide Web, ACM
Future work                                                Press (2004), 658-665.
Iterative evaluation should be conducted during
                                                           [7] Lawrie, D. and Croft, W.B. Discovering and
development of a system to establish the point of          comparing topic hierarchies. In Proc. RIAO 2000,
diminishing returns. The development of anchored           Elsevier (2000), 314-330.
scales for use in this approach would allow for use of     [8] Lawrie, D. and Croft, W.B. Generating hierarchical
further statistical analyses. Research into automatic      summaries for web searches. In Proc. SIGIR 2003,
                                                           ACM Press (2003), 457-458.
category creation in general stands to benefit from the
creation of gold standard datasets, which might be         [9] Li, T., Zhu, S., and Ogihara, M. Topic hierarchy
accomplished through a large-scale card sort study.        generation via linear discriminant projection. In Proc.
                                                           SIGIR 2003, ACM Press (2003), 421-422.
Acknowledgements                                           [10] Nanas, N., Uren, V., and de Roeck, A. Building and
                                                           applying a concept hierarchy representation of a user
I want to thank Emilia Stoica, Lisa J. Elliott, Sara
                                                           profile. In Proc. SIGIR 2003, ACM Press (2003), 198-
Gilliam and Marti Hearst. This material is based in part
                                                           204.
on work supported by the NSF CISE Directorate.
                                                           [11] Pirolli, P., Schank, P., Hearst, M., and Diehl, C.
                                                           Scatter/Gather browsing communicates the topic
References
                                                           structure of a very large text collection. In Proc.
[1] Amigó, E., Giménez, J., Gonazlo, J., and Màrquez,
                                                           SIGCHI 1996, ACM Press (1996), 213-220.
L. MT evaluation: Human-like vs. human acceptable.
In Proc. COLING/ACL 2006, ACL (2006), 17-24.               [12] Sanderson, M. and Croft, B. Deriving concept
                                                           hierarchies from text. In Proc. SIGIR 1999, ACM Press
[2] Borlund, P. and Ingwersen, P. The development of
                                                           (1999), 206-213.
a method for the evaluation of interactive information
retrieval systems. In Journal of Documentation 53, 3       [13] Stoica, E. and Hearst, M. Nearly-automated
(1997), 225-250.                                           hierarchy creation. In Proc. HLT/NAACL 2004, ACL
                                                           (2004), 117-120.
[3] Craggs, R. and Wood, M. Evaluating discourse and
dialogue coding schemes. Computational Linguistics         [14] Sun, A. and Lim, E.-P. Hierarchical text
31, 3 (2005), 289-296.                                     classification and evaluation. In Proc. of the 1st IEEE
                                                           International Conference on Data Mining, IEEE
[4] English, J., Hearst, M., Sinha, R., Swearingen, K.,
                                                           Computer Society (2001), 521-528.
and Yee, K. Hierarchical faceted metadata in site
search interfaces. In Communications of the ACM 45, 9      [15] Wu, Y.B., Shankar, L., and Chen, X. Finding more
(2002), 628-639.                                           useful information faster from web search results. In
                                                           Proc. CIKM 2003, ACM Press (2003).
[5] Krowne, A. and Halbert, M. An initial evaluation of
automated organization for digital library browsing. In
Proc. JCDL 2005, ACM Press (2005), 246-255.

More Related Content

Similar to Evaluation Scheme Hierarchical Browsing Structures

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Waqas Tariq
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classificationrahulmonikasharma
 
Feature selection for classification
Feature selection for classificationFeature selection for classification
Feature selection for classificationefcastillo744
 
IRJET- Analysis of Rating Difference and User Interest
IRJET- Analysis of Rating Difference and User InterestIRJET- Analysis of Rating Difference and User Interest
IRJET- Analysis of Rating Difference and User InterestIRJET Journal
 
Supervised Machine Learning: A Review of Classification ...
Supervised Machine Learning: A Review of Classification ...Supervised Machine Learning: A Review of Classification ...
Supervised Machine Learning: A Review of Classification ...butest
 
Software requirement analysis enhancements byprioritizing re
Software requirement analysis enhancements byprioritizing reSoftware requirement analysis enhancements byprioritizing re
Software requirement analysis enhancements byprioritizing reAlleneMcclendon878
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxdaniahendric
 
Survey on software remodularization techniques
Survey on software remodularization techniquesSurvey on software remodularization techniques
Survey on software remodularization techniqueseSAT Publishing House
 
Survey on software remodularization techniques
Survey on software remodularization techniquesSurvey on software remodularization techniques
Survey on software remodularization techniqueseSAT Publishing House
 
A Survey of Synergistic Relationships For Designing Architecture: Scenarios, ...
A Survey of Synergistic Relationships For Designing Architecture: Scenarios, ...A Survey of Synergistic Relationships For Designing Architecture: Scenarios, ...
A Survey of Synergistic Relationships For Designing Architecture: Scenarios, ...ijbuiiir1
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...IRJET Journal
 
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
 
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...IJSRD
 

Similar to Evaluation Scheme Hierarchical Browsing Structures (20)

M43016571
M43016571M43016571
M43016571
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Research proposal
Research proposalResearch proposal
Research proposal
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classification
 
Feature selection for classification
Feature selection for classificationFeature selection for classification
Feature selection for classification
 
IRJET- Analysis of Rating Difference and User Interest
IRJET- Analysis of Rating Difference and User InterestIRJET- Analysis of Rating Difference and User Interest
IRJET- Analysis of Rating Difference and User Interest
 
Supervised Machine Learning: A Review of Classification ...
Supervised Machine Learning: A Review of Classification ...Supervised Machine Learning: A Review of Classification ...
Supervised Machine Learning: A Review of Classification ...
 
Software requirement analysis enhancements byprioritizing re
Software requirement analysis enhancements byprioritizing reSoftware requirement analysis enhancements byprioritizing re
Software requirement analysis enhancements byprioritizing re
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docx
 
Bv31491493
Bv31491493Bv31491493
Bv31491493
 
Survey on software remodularization techniques
Survey on software remodularization techniquesSurvey on software remodularization techniques
Survey on software remodularization techniques
 
Survey on software remodularization techniques
Survey on software remodularization techniquesSurvey on software remodularization techniques
Survey on software remodularization techniques
 
A Survey of Synergistic Relationships For Designing Architecture: Scenarios, ...
A Survey of Synergistic Relationships For Designing Architecture: Scenarios, ...A Survey of Synergistic Relationships For Designing Architecture: Scenarios, ...
A Survey of Synergistic Relationships For Designing Architecture: Scenarios, ...
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...
 
At4102337341
At4102337341At4102337341
At4102337341
 
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
 
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Evaluation Scheme Hierarchical Browsing Structures

  • 1. An Evaluation Scheme for Hierarchical Information Browsing Structures Abstract Megan Richardson There is no widely accepted means of evaluating New Mexico State University category systems for information search and browsing. P.O. Box 30001 This presentation outlines an evaluation scheme and an Las Cruces, NM 88003-8001 evaluation method that applies the scheme. The merichar@nmsu.edu scheme delineates features broadly classified under comprehensiveness, coherence, and correctness. The method evaluates the category system through a survey distributed among subject domain experts. The method requires minimal resources, is easily conducted remotely, and is easily modified. The approach finds the over- and under-sensitivities of the method of generating the system. A case study has demonstrated the usefulness of the approach, and the inter-rater reliability found suggests that the evaluation scheme is meaningful. Keywords Evaluation, search, browsing, navigation, category systems, hierarchies, information architecture, heuristics, usability, design, human factors ACM Classification Keywords Copyright is held by the author/owner(s). H.5.2. Information interfaces and presentation: User CHI 2008, April 5 – April 10, 2008, Florence, Italy. interfaces – Evaluation/methodology, User-centered ACM 978-1-60558-012-8/08/04. design. H.3.3. Information storage and retrieval: Information search and retrieval – Search process.
  • 2. Introduction with other desirable features of a system, such as This presentation describes an evaluation scheme (or usability or preferability, and they often do not provide schema) for hierarchical information search and diagnostic information for use in system development. browsing systems, a study method that applies the Furthermore, most of the metrics in this class apply scheme, and a case study that establishes the reliability better to clusters than to hierarchies, having been of the scheme. developed for use with document retrieval and similar applications where output can meaningfully be Because of the large volume of digital information regarded as unstructured bags of words. available, designers of information browsing systems turn to automation where possible; yet there is no Iterative task-oriented usability testing can assess the standard method for evaluating the systems that usefulness of a category system for completion of provide this automation. Evaluations can guide search tasks. These methods, however, can be costly development and demonstrate the value of a system to and time-consuming. stakeholders. The technique described here allows for quick, inexpensive evaluations retaining the best IR metrics are intrinsic metrics taken using automatic features of existing evaluation methods. methods. Usability criteria are extrinsic metrics taken using empirical methods. Table 1 classifies some Background: Category system evaluation category system evaluation metrics in terms of whether There is no widely accepted means of evaluating the methods are automatic (ad hoc, implemented in category systems for search and browsing. Several software) or empirical (post hoc, involving surveying or metrics have been applied, but each has drawbacks. testing with representative human participants), and intrinsic (expressing purportedly desirable properties of Since category system creation has some historical the system) or extrinsic (assessing the usefulness of roots in the field of information retrieval (IR), the category system in completion of tasks). traditional IR evaluation metrics have been applied [e.g., 2]. These include precision (the proportion of A system will have good features and bad features and relevant items retrieved to the total number of items it is usually meaningless to average scores across retrieved), recall (the proportion of relevant items features. The key to evaluation is provision of specific retrieved to the total number of relevant items in the feedback. The method presented here does provide collection), and combinations thereof such as F- specific feedback, and this feedback has been shown to measure (the harmonic mean of precision and recall). be useful, presumably because it derives from a Such automatic measurements are highly practical in meaningful analysis of the relevant features of an that they easily can be retaken with each iteration of information browsing system. The method combines development of the system. Unfortunately, most such the ease of automatic methods and the validity of measurements have not been definitively correlated usability testing.
  • 3. Automatic Methods Empirical Methods Comparison to Ideal System User Evaluation (includes the present study) Intrinsic Metrics • precision and recall • percent correct relationship between pairs [12] • average uninterpolated precision [10] • understandability [11] • similarity as ratio of pairs common to two • accuracy [5, 9] • preferences [4, 6, 12, 15] hierarchies to total pairs in one hierarchy [7] • similarity as category distance [14] • path length [8] Usability Testing Extrinsic Metrics • task completion rate • time to completion Table 1. Category system evaluation metrics. The evaluation scheme: comprehensiveness, A study method coherence and correctness The study technique presented here was used in the The scheme includes evaluators’ overall impressions of test case and was found to be effective. It is important the extent to which the system captures the important to note, however, that it is but one possible application concepts for the collection. This information is of the scheme. Traditional heuristic evaluations, collected in the study method described through direct usability studies, and card sort evaluations could be solicitation of participant opinion. The scheme also based on the scheme. Where gold standard data are takes account of specific features of the system: its available, automatic methods could be based on the completeness, correct naming (or labeling) of scheme. categories, appropriate differentiation of categories, correct depth and scope of categories, and the The study is conducted via an online survey. coherence and correct placement of subcategories. Participants are shown the category system and asked These features are assessed in the study method by about the top-level categories. Then participants are asking participants about specific changes they would asked to look at the second-level categories under each make at particular points in the structure. Since of two headings, and at a pre-selected sample of lower- responses are categorical, the structure’s rating on level categories, and asked about those. Examples of each feature consists in the rating given by the most the questions asked are provided in Table 2. (A participants. Results indicate over- and under- complete list of questions is available on request). sensitivities of the method used to produce the Participants report subjective ratings on a four-point structure. Table 2 presents the scheme. scale. Specific information is also recorded on changes the participants would make to the category structure.
  • 4. General Feature Specific Feature Sample Questions [at top level] [Do] these categories capture the important concepts for this collection[?] Capture of important concepts Comprehensiveness [at top level] Would you add any top-level categories? Completeness [for subcategories] Would you add any subcategories to this category? Are there any categories you would split into two, three, or more categories? Category coherence Coherence Are there any categories you would merge together? Differentiation [for subcategories] Would you promote any subcategories to top-level? Depth [for subcategories] Placement Would you move any subcategories to a different existing top-level category? [at top level] Are there any categories you would keep, but rename? Correct naming Correctness [at top level] Would you remove any top-level categories? Scope (exclusion of misfit concepts) [for subcategories] Would you remove any subcategories entirely? Table 2. The evaluation scheme. Modes of the scaled responses are found. The search and browsing system requires that metadata be comprehensiveness, coherence, and correctness of assigned to items in the collection, specifying their categories are assessed primarily by looking at the facets. Because such browsing systems are specifically extent to which participants would add, remove, move, intended for use with large collections, hand-coding of or rename categories. this metadata is costly; hence the need for algorithms that automatically assign metadata. Castanet is an Inter-rater agreement is calculated using Kendall’s algorithm that assigns facet and hierarchy metadata coefficient of concordance (W). Scales are treated as [13]. categorical responses. Significance of difference from a null hypothesis of maximal rating can be determined The study using chi-square tests. Further between-subjects Castanet was evaluated using the present scheme and analyses are possible, but these were conducted in the method. Two datasets were used: a list of 3275 test case and produced no informative results. journal titles taken from MEDLINE citations, and a Comparative evaluations can be conducted by finding collection of 13000 recipes found on web sites. These any significant differences in responses to two systems. collections were previously used in Castanet pilots and were found to have fewer ambiguities than other Test case and scheme reliability collections [13]. Castanet Facets are orthogonal descriptors used to categorize The questionnaire was first tested with two users, an items in a collection. If a facet is hierarchical rather expert in each subject domain. Critiques were solicited than flat, a label may have other labels beneath it in of the study materials, especially concerning the clarity the structure. The creation of a faceted hierarchical and number of questions and answer choices.
  • 5. Best Intermediate Ratings Worst Since in studies of search it is important to use Rating Rating participants with interest in the subject matter [2], e.g., e.g., e.g., participation was solicited on email discussion lists Strongly Agree Disagree Strongly agree somewhat, somewhat disagree dedicated to the relevant subjects. In the evaluation Overall under discussion, there were 49 participants. Impressions Participants were entered in a drawing for a gift Important Concepts 6% 18% 7% 69% certificate for Amazon.com. Intuitiveness 23% 26% 14% 38% Specific Changes Results and discussion Completeness 32% 15% 8% 46% Results are summarized in Table 3. Mode responses Category Coherence 14% 3% 2% 81% indicate that participants would not make changes. For Differentiation 29% 20% 2% 49% Correct Depth 22% 4% 0% 74% one database, however, Castanet did poorly in naming Correct Placement 26% 9% 3% 63% and created too many top-level categories. These Correct Naming 29% 22% 5% 45% results led Castanet’s creator to focus on the Scope 31% 9% 6% 54% algorithm’s handling of synonymy [Stoica, personal Table 3. Summary of ratings for Castanet, averaged over communication]. biomedical and recipe conditions, rounded to nearest percent. Percentage of participants giving the mode response in bold. Despite otherwise good ratings, ratings of Castanet’s intuitiveness were thinly distributed across the scale. Advantages of the approach Since usefulness and acceptability do not require that a The approach evaluates the means of generating the system match human intuition [1], measures thereof browsing structure, not merely the browsing structure were removed from the scheme based on this finding. itself. It finds the nature of weaknesses of the method of generating the structure, whether the method is Reliability of the evaluation scheme automatic (an algorithm), manual, or both. The Reliability of a coding scheme is based on agreement evaluation need not cover all parts of a system to between annotators [3]. A high (but sub-maximal) pinpoint important problems. The feedback provided is inter-rater agreement suggests that the scheme is useful for improving the system under consideration. neither idiosyncratic nor trivial and that raters have some shared understanding of the categories in the Importantly, the scheme has been established as scheme. In the test case, inter-rater agreement was reliable by the test case. significant at W = 0.4309 (Х2(4, N = 49) = 63.7695, p < 0.0001). This inter-rater reliability suggests that the The study method requires minimal resources, time, chosen evaluation scheme is meaningful. and effort, requiring only a judicious selection of test categories (done in the manner of task selection for Furthermore, the scheme’s usefulness is suggested by usability studies), a simple survey set-up, and basic the reported applicability of results.
  • 6. [6] Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., statistical analyses. This type of study is easily done and Krishnapuram, R. A hierarchical monothetic remotely and could be modified to be more task-based document clustering algorithm for summarization and while retaining its desirable features. browsing search results. In Proc. of the 13th International Conference on World Wide Web, ACM Future work Press (2004), 658-665. Iterative evaluation should be conducted during [7] Lawrie, D. and Croft, W.B. Discovering and development of a system to establish the point of comparing topic hierarchies. In Proc. RIAO 2000, diminishing returns. The development of anchored Elsevier (2000), 314-330. scales for use in this approach would allow for use of [8] Lawrie, D. and Croft, W.B. Generating hierarchical further statistical analyses. Research into automatic summaries for web searches. In Proc. SIGIR 2003, ACM Press (2003), 457-458. category creation in general stands to benefit from the creation of gold standard datasets, which might be [9] Li, T., Zhu, S., and Ogihara, M. Topic hierarchy accomplished through a large-scale card sort study. generation via linear discriminant projection. In Proc. SIGIR 2003, ACM Press (2003), 421-422. Acknowledgements [10] Nanas, N., Uren, V., and de Roeck, A. Building and applying a concept hierarchy representation of a user I want to thank Emilia Stoica, Lisa J. Elliott, Sara profile. In Proc. SIGIR 2003, ACM Press (2003), 198- Gilliam and Marti Hearst. This material is based in part 204. on work supported by the NSF CISE Directorate. [11] Pirolli, P., Schank, P., Hearst, M., and Diehl, C. Scatter/Gather browsing communicates the topic References structure of a very large text collection. In Proc. [1] Amigó, E., Giménez, J., Gonazlo, J., and Màrquez, SIGCHI 1996, ACM Press (1996), 213-220. L. MT evaluation: Human-like vs. human acceptable. In Proc. COLING/ACL 2006, ACL (2006), 17-24. [12] Sanderson, M. and Croft, B. Deriving concept hierarchies from text. In Proc. SIGIR 1999, ACM Press [2] Borlund, P. and Ingwersen, P. The development of (1999), 206-213. a method for the evaluation of interactive information retrieval systems. In Journal of Documentation 53, 3 [13] Stoica, E. and Hearst, M. Nearly-automated (1997), 225-250. hierarchy creation. In Proc. HLT/NAACL 2004, ACL (2004), 117-120. [3] Craggs, R. and Wood, M. Evaluating discourse and dialogue coding schemes. Computational Linguistics [14] Sun, A. and Lim, E.-P. Hierarchical text 31, 3 (2005), 289-296. classification and evaluation. In Proc. of the 1st IEEE International Conference on Data Mining, IEEE [4] English, J., Hearst, M., Sinha, R., Swearingen, K., Computer Society (2001), 521-528. and Yee, K. Hierarchical faceted metadata in site search interfaces. In Communications of the ACM 45, 9 [15] Wu, Y.B., Shankar, L., and Chen, X. Finding more (2002), 628-639. useful information faster from web search results. In Proc. CIKM 2003, ACM Press (2003). [5] Krowne, A. and Halbert, M. An initial evaluation of automated organization for digital library browsing. In Proc. JCDL 2005, ACM Press (2005), 246-255.