Centralized Taxonomy Management for Enterprise Information Systems


Published on

Daniela Barbosa, Synaptica Business Development Manager, Dow Jones Client Solutions, Dow Jones & Company
Paula R McCoy, Manager, Taxonomy Development, ProQuest

Now that you have built your taxonomies, you need to manage and maintain them in a centralized environment that can be leveraged by all of your enterprise applications including search tools, portals, and CMS/DMS systems. This session will review some best practices in centralized taxonomy management and go through the implementation of a thesaurus management tool at ProQuest, which enabled them to create a common language to connect disparate information assets using large and varied vocabularies and authority files linked to new and existing editorial systems.

Published in: Technology, Business

Centralized Taxonomy Management for Enterprise Information Systems

  1. 1. Centralized Taxonomy Management for Enterprise Information Systems Enterprise Search Summit Wednesday, September 24th, 2:00 pm – 2:30 pm Dow Jones Client Solutions ProQuest Synaptica Manager, Taxonomy Development [email_address] [email_address]
  2. 2. Dow Jones Taxonomy Solutions <ul><li>Words </li></ul><ul><li>Dow Jones taxonomy licensing </li></ul><ul><li>Other taxonomy licensing (Taxonomy Warehouse) </li></ul><ul><li>Taxonomy customization </li></ul><ul><li>Taxonomy development </li></ul><ul><li>Expertise </li></ul><ul><li>Taxonomy Assessment </li></ul><ul><li>Taxonomy Consulting </li></ul><ul><li>Analysis </li></ul><ul><li>Recommendations </li></ul><ul><li>Implementation </li></ul><ul><li>Workshops </li></ul><ul><li>Tools </li></ul><ul><li>Synaptica: </li></ul><ul><li>Taxonomy / Metadata -- Management Tool </li></ul>
  3. 3. Some Definitions A taxonomy is a hierarchical topic structure to which information can be assigned through the dual processes of classification (filing to a location) and categorisation (tagging with relevant metadata ). A taxonomy provides browsable navigation and supports filtered search ing A thesaurus is a controlled vocabulary linking an organisation’s common language to its taxonomy structure. It accommodates synonyms, acronyms, language variants and other near equivalences. It also signposts non-hierarchical linkages within and across the taxonomy facets. A thesaurus is usually employed to interpret and guide user search queries An ontology is the working model of entities and interactions in a particular domain of knowledge or content set. It is a set of concepts - such as things, events, and relations - that are specified in some way in order to create an agreed-upon vocabulary for exchanging information. An ontology is increasingly used to visualise (or map) a set of search results and discover new or hidden connections
  4. 4. Classic taxonomy … groups things or concepts into families SIDEWAYS Traditional thesaurus … captures the different names of the family members and explores some more distant associations (cousins & close friends) Multi- Directional Emerging ontology … shows a network of multi-dimensional relationships and properties both within and outside the family groups UP DOWN
  5. 5. Telephones Is a broader term than Mobile Phones SIDEWAYS Mobile Phones AKA as Cell Phones & Hand Phones And Similar to Hand Held Devices & PDAs Multi- Directional Mobile Phones Are made by Phone Manufacturers And use the networks of Telecoms Service Providers UP DOWN
  6. 6. <ul><li>Metadata’s Evolutionary Path </li></ul>Dictionaries & Flat Lists Hierarchical Taxonomies Controlled Vocabulary Thesauri Ontologies Structured Authority Files Metadata is evolving organically – the less complex metadata elements form the building blocks for creating the more complex structures
  7. 7. <ul><li>Portal navigation and browsable website menus </li></ul><ul><li>Conceptual access to large databases  </li></ul><ul><li>Records management and cataloging </li></ul><ul><li>e-Commerce online product catalogues </li></ul><ul><li>Inventory control and de-duplication </li></ul><ul><li>Auto-classification of internal documents and email </li></ul><ul><li>Multilingual search and browse </li></ul><ul><li>Metasearch of enterprise-wide resources </li></ul>Practical Applications
  8. 8. Centralized Taxonomy and Metadata Management As a centralized repository for multi-lingual semantic management that is: - Independent from systems like web-portal search and categorization systems - Scalable ; capable of evolving with emerging corporate semantic standards HTML CSV XML ZThes SKOS OWL Web Services Centralized Taxonomy Management System Synaptica ® Portals Portals Categorizers Portals Portals Search Engines Portals Portals Content Portals Multiple users working in collaborative and compartmentalized space P e r m i s s i o n s
  9. 9. <ul><li>Metadata can transcend information islands and data silos but only if the enterprise is committed to common standards </li></ul><ul><li>A centralized system that supports both collaboration and compartmentalization allows common metadata to be shared while also allowing user communities the independence to manage specialized metadata files </li></ul>Why Centralized?
  10. 10. <ul><li>Enterprises are increasingly making use of multiple proprietary and open source software tools for categorization, search and portal tasks </li></ul><ul><li>While many of these tools support some level of metadata management the diversity of standards, data formats and business rules they support can actually result in exacerbating the data silo problem by creating metadata silos </li></ul>Why Independent?
  11. 11. Where taxonomy fits with Search DMS CMS Shared Docs News & Research Data Search Engine Taxonomy & Metadata Platform Information Processing, Management and Storage
  12. 12. 4 Good Reasons for Taxonomy Search Relevancy Search Completeness Search Federation Search Visualisation Effective Research/Risk Mitigation Knowledge Worker Productivity Discovery & Innovation Better & Faster Decisions
  13. 13. <ul><li>Improved Search Relevancy </li></ul><ul><li>Ambiguity of Language </li></ul><ul><ul><li>Is a Blackberry a fruit or a handheld device? </li></ul></ul><ul><li>By including this brand name in a taxonomy we can give context to the user search query </li></ul><ul><li>In a telecoms domain we can assume that the user means the latter and only return content tagged as such </li></ul><ul><li>Alternatively we can weight the results, promoting those documents about handheld devices above those that refer to the fruit </li></ul><ul><li>Either way the result is increased search precision which translates into time savings </li></ul>
  14. 14. 2. Improved Search Completeness <ul><li>Synonymous and Related Term Relationships </li></ul><ul><ul><li>Mobile Phone (PT) = Cell Phone (NPT) = Hand Phone (NPT) </li></ul></ul><ul><ul><li>Mobile Phone is related to Hand Held Device (RT) </li></ul></ul><ul><li>User Search Query = “Cell Phones” </li></ul><ul><li>The taxonomy simultaneously broadens the search and prioritises the returned results giving increased recall without compromising relevancy </li></ul><ul><li>Content tagged with Mobile Phone category are promoted over those not tagged using a weighting in the search algorithm </li></ul><ul><li>Content tagged with Hand Held Device category may also receive a weighting </li></ul>
  15. 15. 3. Search federation and data integration <ul><li>A snapshot or dashboard is often more desirable than a list of document titles or snippets, especially when looking for information on a customer, supplier or competitor </li></ul><ul><li>Also, information will most likely reside in a number of internal repositories, each with their own levels of metadata structure </li></ul><ul><li>Taxonomy allows the combination of news, internal CI reports, price plans, coverage data, market share data, share price etc. in one consolidated view by providing mappings or cross-walks </li></ul><ul><li>This is essentially applying business intelligence discipline to the world of unstructured information </li></ul>
  16. 16. 4. Search Visualisation <ul><li>The previous three scenarios assume the user knows what they are looking for </li></ul><ul><li>But what about serendipitous discovery? </li></ul><ul><li>By being able see across an aggregation of content and extract facts and relationships from deep within the information stores, true (and sometimes fortunate) discovery can take place </li></ul>
  17. 17. Document, Content & Records Management Synaptica ® Vocabulary & Metadata Management Thesauri Ontologies Filing & Storage Metadata Tagging (Categorisation) Process Search Engine Visualisation Navigation Intranet / Portal User Interface Back End Information Structure Front End Information Intelligence Librarians; Taxonomists; Indexers; Knowledge & Information Managers Information Creators; Records Managers; Content Managers; Librarians; Indexers Information Users (the business; the public) Taxonomies CIOs; CTOs; IT Architects
  18. 18. Paula R. McCoy Manager, Taxonomy Development ProQuest [email_address] Centralized Taxonomy Management for Enterprise Information Systems
  19. 19. <ul><li>Description of ProQuest Controlled </li></ul><ul><li>Vocabulary & Authority Files </li></ul><ul><li>Taxonomy Management -- Overview </li></ul><ul><li>Managing Terms Manually </li></ul><ul><li>Synaptica Thesaurus Management System </li></ul>Topics of Discussion
  20. 20. Access to over 125 billion digital pages of content from magazine, trade, & scholarly publications, current & historical newspapers, original materials such as annual reports & civil war pamphlets, and daily wire feeds Subscription-based ProQuest® online information service available in academic and public libraries
  21. 21. <ul><li>ProQuest Controlled Vocabulary used to index </li></ul><ul><li>subjects; Authority Files used to index </li></ul><ul><li>company, geographic, personal, product names </li></ul><ul><li>CV applied to non-periodical & third-party </li></ul><ul><li>content via mapping, to allow cross-searching </li></ul><ul><li>of multiple DBs with one vocabulary </li></ul>
  22. 22. <ul><li>Created in 1970s for ABI/INFORM business database </li></ul><ul><li>Based on Library of Congress Subject Headings </li></ul><ul><li>Natural language, hierarchical vocabulary complying </li></ul><ul><li>with ANSI/NISO Standard Z39.19 (Guidelines for </li></ul><ul><li>the Construction, Format, and Management of </li></ul><ul><li>Monolingual Controlled Vocabularies) </li></ul>ProQuest Controlled Vocabulary
  23. 23. ProQuest Controlled Vocabulary <ul><li>Thesaurus subjects: </li></ul>Business, economics & trade – 4300 terms Science, math & technology – 1600 terms Medicine – 1150 terms Humanities – 960 terms Government & policy – 850 terms Education – 400 terms <ul><li>Merged with general reference vocabulary in 1980s </li></ul><ul><li>Major development effort in past 4 years to boost </li></ul><ul><li>science, education & medical terms </li></ul>
  24. 24. ProQuest CV: Statistics <ul><li>Preferred terms: 11,046 </li></ul><ul><li>Non-preferred terms: 5631 </li></ul><ul><li>Scope Notes: 3194 (29%) </li></ul><ul><li>Cross-references (Broader, </li></ul><ul><li>Narrower, Related terms): 67,700 </li></ul><ul><li>Terms added in 2007: 77 </li></ul><ul><li>Terms added in 2008: 58+ </li></ul>
  25. 25. Authority Files: Statistics <ul><li>Corporate/Organization Names: 438,098 </li></ul><ul><li>Names added in 2008: 5489 </li></ul><ul><li>Personal Names: 416,239 </li></ul><ul><li>Names added in 2008: 1526 </li></ul><ul><li>Geographic (Location) Names: 34,331 </li></ul><ul><li>Names added in 2008: 144 </li></ul><ul><li>Product Names: 38,210 </li></ul><ul><li>Names added in 2008: 54 </li></ul>
  26. 26. The Taxonomy Manager’s Job <ul><li>Add subject terms as dictated by new </li></ul><ul><li>concepts and new content to index </li></ul><ul><li>Maintain hierarchies & Scope Notes </li></ul><ul><li>Load updated Thesaurus to ProQuest interface </li></ul><ul><li>Manage authority files to maintain standards </li></ul><ul><li>& control file size </li></ul>
  27. 27. The Taxonomy Manager’s Job To ensure that indexers and searchers alike have access to a complete and accurate Thesaurus that they can use to maximize the discoverability of documents in ProQuest OBJECTIVE:
  28. 28. Sample Subject Term Chronic obstructive pulmonary disease SN: Any lung disease, such as chronic bronchitis or emphysema, causing obstruction of bronchial airflow    UF  COPD    BT  Disease    BT  Respiratory diseases    NT  Asthma    NT  Bronchitis    NT  Emphysema    RT  Airway management    RT  Lungs Preferred, or main term Scope note defining term and how it is used Non-preferred term: points to term used to index Terms broader in nature to main term: COPD is a disease, and specifically, a respiratory disease Terms narrower in nature to main term: these are chronic lung diseases Terms related to main term that might be used to narrow the search
  29. 29. <ul><li>New scientific content requiring a huge enhancement to vocabulary </li></ul><ul><li>Seven MS Word vocabulary documents— </li></ul><ul><li>English and foreign language (French, German, </li></ul><ul><li>Spanish)—printed for internal use only </li></ul><ul><li>Six authority files & 3 vocabulary files in Oracle </li></ul><ul><li>databases, requiring duplicate entry of subject </li></ul><ul><li>terms in Word and Oracle </li></ul><ul><li>Legacy editorial system in process of being </li></ul><ul><li>replaced </li></ul>Managing Terms Manually
  30. 30. Thesaurus Management Systems Buying Criteria Thesaurus Management System: Requirements <ul><li>Eliminate double entry </li></ul><ul><li>Improve editorial interface with vocabulary </li></ul><ul><li>Automate entry of reciprocal relationships </li></ul>
  31. 31. Life With Synaptica Word – Old, Bad  Synaptica – New, Good 
  32. 32. Adding Terms Today: 3 Easy Steps 2. Export report of new terms into Word 1. Enter term and relationships into Synaptica “ Item Details” window 3. Send Word document to editors
  33. 33. Improving Thesaurus Management Categories Feature
  34. 34. Subject Term Categories
  35. 35. CORP Names – Categories & Website
  36. 36. Foreign-Language Vocabularies Language Equivalents
  37. 37. Foreign-Language Vocabularies Life With Synaptica Spanish German French Spanish Alphabetical by language
  38. 38. Synaptica Updates <ul><li>Synaptica version 6.0 released in early 2006 </li></ul><ul><li>Synaptica version 7.0 is being implemented now: </li></ul><ul><li>Enhanced user interface </li></ul><ul><li>Semantic Web standardization (RDF, OWL, SKOS) and </li></ul><ul><li>Web Services integration </li></ul><ul><li>Expanded Reporting functionality </li></ul><ul><li>Enhanced adding and editing of term relationships </li></ul><ul><li>including “rapid-fire” simple drag-and-drop editing </li></ul><ul><li>Improved global term editing </li></ul><ul><li>Online help and user guides </li></ul>
  39. 39. Benefits of Synaptica <ul><li>Greater awareness of thesaurus standards and </li></ul><ul><li>terminology, e.g.: “preferred” and “non-preferred” </li></ul><ul><li>instead of Use and Used For </li></ul><ul><li>Long-needed updating and improvement in term </li></ul><ul><li>hierarchies; ability to provide thesaurus statistics </li></ul><ul><li>Increase in Company name NPTs — from 1935 to </li></ul><ul><li>8952 today </li></ul><ul><li>Immediate responsiveness to indexer needs — </li></ul><ul><li>real-time term additions, esp. NPTs and SNs </li></ul><ul><li>Easier loading of updated Thesaurus on PQ interface </li></ul>
  40. 40. thank you!