Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DHUG 2017 - Thesaurus Construction Training

155 views

Published on

Learn the basics and general guidelines of thesaurus and taxonomy construction the way Access Innovations does it.

Published in: Software
  • Be the first to comment

DHUG 2017 - Thesaurus Construction Training

  1. 1. General Guidelines for Thesaurus Construction
  2. 2. Where To Start
  3. 3. Before You Begin Be Sure to Know How will we use the thesaurus? What is the size and the scope of the project? Who will be viewing and accessing the vocabulary? When will we update the thesaurus?
  4. 4. Approaches For Taxonomy Design Top-down • Identifying top categories first or utilizing a pre-established category list for your top areas. • Organizing each top domain with the most relevant and broad coverage of documents • Dividing each category with subcategories to narrow granular topical areas • Establishing attributes and additional subcategories to each thesaurus node created
  5. 5. Approaches For Taxonomy Design Bottom-up • Beginning from an unsorted list of vocabulary terms and concepts compiled from multiple resources • Moving terms in the list to classify their Broader / Narrower relationships • Declaring top domains after exploring the amount of topics covered in a single term • Guaranteeing every term to be evaluated at least once • Sorting extremely large unsorted data sets efficiently
  6. 6. Top-down • Easier for smaller vocabulary sets • Quick method of identifying key top areas • Designed for a navigational mindset Bottom-up • More accurate representation of the content • Ideal for larger scale thesauri • Content drives the entire structure of the thesaurus We recommend a mix of both, but every vocabulary demands different courses of action
  7. 7. Resource Gathering
  8. 8. Resources for Designing a Thesaurus Existing Controlled Vocabularies • Additional taxonomies • Classification schemes • Topics and headings • Sitemaps • Glossaries and Definitions Listing of Keywords • Entered by an author or indexer • May range in size from 100 to 100,000 terms
  9. 9. Resources for Designing a Thesaurus • Search Logs • An unruly mess of words • What to look out for… • Which topics are more frequently searched for by users? • Has common terminology for concepts and technologies changed within the past x years? • Trim search logs to the most frequent and concise topics • Data Mining • N-gram tests • “Content-aware” vocabulary
  10. 10. Defining the Thesaurus Specialization What Goes into the Thesaurus?
  11. 11. Selecting Thesaurus Terms • Looking for descriptors, terms in the thesaurus which must adequately reflect the content • Terms which describe fields of study, technology, applications, devices, research, and other content • Thesaurus terms must be concise, must express a single concept, and must be free of ambiguity. • Concepts such as General and Applications will not describe what is written within a single document.
  12. 12. Literary warrant • Justification for the representation of a concept in an indexing language or for the selection of a preferred term because of its frequent occurrence in the literature Organizational warrant • Justification for the representation of a concept in an indexing language or for the selection of a preferred term due to characteristics and context of the organization User Warrant • Justification for the representation of a concept in an indexing language or for the selection of a preferred term because of frequent requests for information on the concept or free-text searches on the term by users of an information storage and retrieval system.
  13. 13. Creating the Initial Build
  14. 14. Compiling the Terms Existing vocabularies • Be aware of overlap and multiple terminologies • Standardize the terms (plural, hyphenation, etc.) • Breakup pre-coordination if it exists Whether to include the vocabularies current hierarchy (if it contains one) is purely the decision of the thesaurus developer • Will save time and effort to retain existing hierarchy while providing an early look at the structure of the vocabulary • However, conflicting and overlapping terms may cause problems when reviewing the initial build
  15. 15. Filtering the Unsorted Lists Standardize the “Word Salad” • Combining singular and plural forms of terms • Combining hyphenated terms • Removing named entities • Identifying and/or removing acronyms Add only the most frequently searched terms and added keywords • Can limit to the top 50 or 100 most frequent • Too many results can litter a vocabulary with rubbish terms
  16. 16. Next Step – Import!
  17. 17. Creation of the Initial Build • Establish primary categories for the thesaurus • Sort uncontrolled terms into appropriate categories • Most time-consuming process • Content will be re-evaluated, don’t stress too much on getting it right the first time • Create synonyms and related terms as you sort each term • Double-check for conceptual duplicates within the project • Ensure standardized spelling (American vs. British English) • Check for typos • Review Literary, Organizational, and User Warrant for each term • Delete terms with little to no indexing value
  18. 18. Initial Build - Equivalence and Associations Six-Second Rule • As a rule-of-thumb, give yourself six seconds to brainstorm multiple ways to express a single concept. Creating synonyms not only allows for a stronger thesaurus, but will potentially identify duplicate concepts within the early vocabulary. Adding and searching for related terms will identify other subject areas included in the unsorted taxonomy
  19. 19. Evaluation of the Thesaurus Build
  20. 20. Evaluation • Review Literary, Organizational, and User Warrant • Division of top terms • Assign team members top levels to review • Fill in missing gaps of classification • Ensure no flat list of topics (more than 15 terms in a category) exist within a single section • Merge conceptual duplications within the content • Preferring one expression over the others • Delete terms with little to zero indexing value • Add synonyms not listed for each term • Add related terms which do not appear
  21. 21. Evaluation - Term style and Form Must represent single-train of thought • Removes ambiguity and uncertainty of concepts • Pre-coordination of terms should be disregarded (“Acoustics in music”, “Cancer and metastasis”) Reduce slang and jargon for preferred terms unless no other word describes the concept or if the older terminology is infrequently • (Microelectromechanical Systems and MEMS) • (Quantum bits and Qubits)
  22. 22. Evaluation - Term style and Form Use nouns, or noun phrases / Avoid action verbs for concepts • Catalysis rather than catalyze • Distillation rather than distill • Reading rather than read Adjectives and Adverbs • May be used to differentiate different concepts • Should not be used as individual terms
  23. 23. Evaluation Proper nouns (including names, places, etc.) should have proper capitalization Compound terms • Used for Disambiguation and for specificity • Granular descriptors “Lead coating on copper pipes” Arabian Peninsula Milky Way Galaxy Louvre Albert Einstein
  24. 24. Evaluation - Term style and Form Loanwords are fine if they are covered well within the content (habeas corpus) Abbreviations and acronyms should be spelled out, unless the proper name is rarely used (DNA) Do not include parentheses unless disambiguating the term • Mercury (element) = Okay • Computed tomography (CT) = Frowned upon
  25. 25. Indexing Post-coordination • Two or more thesaurus terms are applied to an article to represent a concept. • Used at the time of search and retrieval Pre-coordination • Terms are combined before indexing • Uses one node to describe content Liver AND Anatomy New York AND Subway Furniture-California-San Francisco-History-20th Century Liver-Blood Vessels-Diseases-Congresses
  26. 26. Post-coordinated terms work more effectively for MAIstro (Thesaurus Master and M.A.I.) • Allows M.A.I. to easily identify subject terms within a range documents without elaborate rules • Easier to maintain simpler vocabulary terms Pre-coordination allows an unlimited amount of terms to be added to the Thesaurus • Expressing multiple concepts within a singular thesaurus term will set a precedence for enabling all terms in this manner • If you have the term Computers in chemistry, what will stop you from creating Computers in biology, Computers in dentistry, Computers in echolocation, etc.
  27. 27. Evaluation - Term style and form Keep terms plural unless • changing the term to a plural form alters the meaning of the term (e.g. Technology; Technologies) • If this is the case, disambiguate the concepts with parenthetical qualifiers Technology (applied sciences) and Technologies (devices) • Literary warrant or User warrant dictates the term to be singular Control the vocabulary through use of synonyms • Terms must represent unique concepts Keep single Train-of-Thought
  28. 28. Revision and Reiteration Thesaurus development is highly cyclical • For multiple personnel, reviewing alternate sections and others work is highly recommended • Alternating a pair of eyes will catch plenty of errors and inconsistencies within the thesaurus terms Subject Matter Expert feedback is always recommended • Must be clear what SMEs are reviewing and why they are reviewing it • Many experts are highly opinionated and unaware of the scope/implementation of the project • Feedback must be re-evaluated (sometimes taken with a grain of salt)
  29. 29. Standards and Compliance • American National Standards Institute / National Information Standards Organization • ANSI/NISO Z.39.19 • British Standards Institute • BS 8723 parts 1-4 • International Standards Institute • ISO 25964
  30. 30. Continue on with the Live Demo

×