Making Decisions in Creating Taxonomies

Making Decisions in
Creating Taxonomies
Heather Hedden
Information Taxonomist, Viziant Corporation

November 8, 2007

Copyright © 2007 Viziant Corporation. All Rights Reserved. Proprietary & Confidential.

Background

• Heather Hedden’s taxonomy development experience
– controlled vocabularies for periodical index databases (Gale)
– matching of controlled vocabulary to keywords for consumer
products/services directories (various “yellow pages” clients)
– enterprise taxonomies for corporate web sites and intranets (Earley
& Associations)
– base and custom taxonomies integrated within a knowledge
discovery and data mining product (Viziant)

• Viziant Corporation
– A provider of information access and intelligence systems for
enterprises and government


Decisions for the Taxonomist

• Decisions of the taxonomy owner
– Approximate number of top-level nodes and number of levels
– Structure: primarily facets or tree
– Interface design: number and layout of displayed nodes
– Presence of polyhierarchies
– Automated search & retrieval or human indexing/tagging

• Decisions often left to the taxonomist
– Exact/final number of levels, nodes per level
– Arrangement of the node hierarchy, placement within facets
– Degree of term pre- or post-coordination
– Extent of use of variants/cross-references


Number of levels, nodes per level

• 3 levels and 6-8 nodes per level is a nice ideal
– Web site/intranet menu navigation
• Menu is confined to bar across top or margin to the side
• Menus pull-down or topic trees expand in place

• More levels and nodes per level are often needed
– Content management/document retrieval for large content
repositories
• industries, products, fields of science, diseases, geographies,
named entities

• Decision: Make more levels or make more nodes per level


Number of levels, nodes per level: Examples

Deep: Many levels

Geographies
- North America - South America - Europe - Asia - Africa - Oceania
-- United States --Central Asia
--- New England --Middle East
---- Massachusetts --South Asia
----- Boston --Southeast Asia
------ North End
------- Old North Church

Broad: Many nodes per level

Geographies
- U.S. cities - U.S. States - Countries - World cities - Continents - Landmarks
-- Albuquerque -- Alabama


Number of levels, nodes per level: Examples

Deep: Many levels (SIC, NAICS style with 10-20 upper level nodes)

Industries
- Transportation services
-- Air transportation
--- Schedule air transportation services
---- Scheduled air freight transportation services

Broad: Many nodes per level (job search sites, 50 - 80 nodes per level)

Industries Second levels at select nodes only: Healthcare, Sales
- Accounting/Auditing
- Administrative Support Services
- Advertising/Marketing/Public Relations
- Aerospace/Aviation/Defense
- Agriculture, Forestry, & Fishing
- Airlines
etc.


Number of levels, nodes per level

• Decision Factors
– Display interface/horizontal and vertical real estate
– Speed of displaying deeper levels
– User market, needs, and expectations
• Industry experts, internal employees, general public, students,
etc.

• Need to balance how much can be easily skimmed in one
view vs. how many levels down the user has patience to
click down through
• More levels lead to less consistency across levels.


Arrangement of node hierarchy

• Decision: What’s the best method to handle different
means of classification within the same hierarchy?
– Industries by traditional SIC/NAICS classification or by vertical
market
– Products by manufacturing technology or by end-use
– Places by physical geographic location or by type
– Organizations by goals/objectives or by political/religious affiliation
– Government agencies by type or by country/state of affiliation

• Even within facets, there often are hierarchies.
• Even allowing polyheirarchies, a top-level classification is
needed, and too many polyhierarchies can be confusing.


Arrangement of node hierarchy: Examples
1. Governmental bodies & agencies
- U.S. governmental bodies & agencies
-- U.S. Courts
-- U.S. executive branch agencies
-- U.S. legislative branch
-- State bodies & agencies
- Foreign governmental bodies & agencies
-- Foreign courts
-- Foreign legislatures
-- Foreign national agencies
-- Foreign state & provincial government agencies

-- Foreign legislatures (+ instances)
-- U.S. legislatures (+ US federal and state instances)

- Legislative bodies
-- National legislatures (+ instances, both foreign and US)
-- State & provincial legislatures (+ all instances alphabetical for US and foreign)

- Legislative bodies (+ all instances, US and foreign, in one alphabetical list)



• Decision: If linking named entities to topical subjects, should they each
link at the lowest node level possible, or group all of them together at a
higher level?

• Example: Link specific churches at the broader term, Churches
(denominations), the appropriate narrower term, or both

Churches (denomination)
- Catholic churches
- Orthodox churches
- Protestant churches

Does the user know where to look for the Maronite Church or the
Assyrian Church of the East?



• Decision factors:
– Knowledge of users as to where to categorize an entity
– Likelihood of users to browse rather than search for entities
– Existence of entities that don’t belong in a subcategory
– Purpose to teach users (students) where entities belong

• Linking entities at both specific and broader level, makes
them easier to find, but clutters up the taxonomy, slows
down performance, and may not seem logical at first to the
user



– User market, needs, and expectations
• How the users classify the subject matter
• Whether a topic is even likely to be browsed for in the taxonomy
or rather entered in the search box
– Support for polyhierachies
– Permissibility of nodes as category labels, not linked to content, at
various intermediate levels within the hierarchy
• e.g. Foreign legislatures

• Need to consider
– Whether to create nodes difficult to distinguish in indexing
• e.g. both Legislative bodies and National legislatures


Placement within facets

• Facets may be determined by taxonomy owner, but
placement of certain nodes may not be obvious
– Institutions could be Places or Organizations
• Places of worship, educational institutions, museums, libraries
– Business activities could be Actions or Topics
• Acquisitions, Contracts, Joint ventures, Sales

• Decisions:
– In which facet to put these nodes
– Whether two (parenthetically modified) nodes for the concept
should be created, one for each facet, e.g. Hotels (buildings) and
Hotels (companies)
– Or whether a node can be in more than one facet


Placement within facets

• Decision factors
– System support for two occurrences of the same-named node
– Automated or manual indexing
• Automated indexing may not distinguish between different facet-
meanings of a term: action or topic, place or organization, etc.


Term pre-coordination or post-coordination

• Hierarchical tree or thesauri serve pre-coordination
– User browses for most specific concept

• Facets serve post-coordination
– User chooses combination of concepts from multiple facets (e.g.
place, product type, usage issue, customer type)

• But topic trees/thesauri may be used within a UI supporting
multiple search terms (narrow a search)
• But hierarchies can exist within facets

• Decisions:
– In a topic tree/thesaurus, whether to expect post-coordination
– In a faceted taxonomy, whether and how much to have pre-
coordination



• Place and Topic facets
– Russian foreign policy or Russia and Foreign policy
– French embassies or France and Embassies
– United States-Canadian relations
• Ethnicity and Occupation facets
– Hispanic writers or Hispanics and Writers
• Body part and Disease facets
– Ovarian cancer or Ovaries and Cancer
• Business action and Product facets
– Drug trials or Product testing and Drugs
– CRM Software or Customer Relations Management and
Software/Marketing software



– Human or automated indexing/tagging
• If human indexing, all could be post-coordinated
– Keyword searching or taxonomy browse
• If Keyword searching, pre-coordinated is preferred
– Nature and volume of content
• Specific content serves narrower pre-coordinated subjects
– Scope of the content
• Wide range of articles is better served by pre-coordination



• Advantages to pre-coordinated terms
– Provide more precise retrieval results, if used correctly
– Better suited for specific, custom taxonomies
– Better suited for phrase search string searching
• Disadvantages to pre-coordinated terms
– Narrower nodes might be overlooked by the user.
– More complex to correctly index.

• Flexibility in degree of pre- or post-coordination is OK, but
consistency of application makes the taxonomy more
usable.


Variants and cross-references

• Variants, Nonpreferred terms, Nonpostable terms,
Equivalent terms, See references, Cross-references,
Keywords

• First, take into consideration:
– Automated stemming
– Taxonomy browse, search, or both. If both, which is dominant
– Content from divergent sources, countries
– System/UI support for a variant pointing to more than one node



• Decision: whether a concept should be a node or its variant
(when they are not synonyms)
– Create a more specific/narrower node, or use it as a variant
• Hydroelectric plants USE Electric power plants
• Factories USE Plants & factories
– Differentiate closely related terms, or use one as a variant
• Foreign policy vs. International relations
• Colleges & universities vs. Higher education
– Differentiate topics from actions, or use one as a variant
• Contracts vs. Contracting
• Investments vs. Investing



• Decision: whether a term should be a node or its variant
(when synonyms)
– Plural vs. singular
– Acronym vs. spelled out form
– Technical/academic vs. popular term
– Synonyms also for a word within a phrase-term
• administration vs. management
• oil vs. petroleum
• communications vs. telecommunications
• health vs. medical



• Decision Factors: for the number of variants per node
– Users as monolithic or diverse
– Size of taxonomy (if browsable)
• If small and easily learned then large number of variants
unnecessary
• Automated indexing needs many more variants
– Keyword searching or taxonomy browse
• If Keyword searching needs more variants
– Nature and volume of content
• Broad/general content needs more variants
– Display of Cross-references
• Limit the number of variants if they display in the UI



• Decision Factors: for the choice of term as node or variant
– User background, level of expertise, expectations
– Political correctness, instructiveness to users
– Number of characters in display width

• The more stakeholders involved, the more complex the
decision in choosing the preferred name of the node


Conclusions

• Taxonomy creation is a decision-making task
• Different decisions are based on different factors
• Each taxonomy project is unique
• Creators/editors of the taxonomy need to know:
– Who are the users and what are their needs
– What is the nature of the content
– What the user interface will look like
– What the system supports (faceted search, multiple cross-refs)
– How the content will be indexed/tagged


Questions?

Heather Hedden
Information Taxonomist
Viziant Corporation
Two International Place, Suite 410
Boston, MA 02110
www.viziantcorp.com

Heather.hedden@viziantcorp.com
617-517-0075 ext. 104
978-467-5195 (cell)


Making Decisions in Creating Taxonomies

Recommended

Recommended

More Related Content

Similar to Making Decisions in Creating Taxonomies

Similar to Making Decisions in Creating Taxonomies (20)

More from Heather Hedden

More from Heather Hedden (17)

Recently uploaded

Recently uploaded (20)

Making Decisions in Creating Taxonomies