Loading...
Flash Player 9 (or above) is needed to view slideshows. We have detected that you do not have it on your computer.To install it, go here
Centralized Taxonomy Management for Enterprise Information Systems
Daniela Barbosa, Synaptica Business Development Manager, Dow Jones Client Solutions, Dow Jones & Company
Paula R McCoy, Manager, Taxonomy Development, ProQuest
Now that you have built your taxonomies, you need to manage and maintain them in a centralized environment that can be leveraged by all of your enterprise applications including search tools, portals, and CMS/DMS systems. This session will review some best practices in centralized taxonomy management and go through the implementation of a thesaurus management tool at ProQuest, which enabled them to create a common language to connect disparate information assets using large and varied vocabularies and authority files linked to new and existing editorial systems.
676 views | comments | 1 favorites | 19 downloads | 2 embeds (Stats)
More Info
This slideshow is Public
Total Views: 676 on Slideshare: 660 from embeds: 16
Most viewed embeds (Top 5):
More
Slideshow Transcript
- Slide 1: Centralized Taxonomy Management for
Enterprise Information Systems
Enterprise Search Summit
Wednesday, September 24th, 2:00 pm – 2:30 pm
Dow Jones Client Solutions ProQuest
Synaptica Manager, Taxonomy Development
daniela.barbosa@dowjones.com paula.mccoy@proquest.com
© Copyright 2007 Dow Jones and Company, Inc.
- Slide 2: Dow Jones Taxonomy Solutions
Words Expertise Tools
Dow Jones taxonomy Taxonomy Assessment Synaptica:
licensing
Other taxonomy licensing Taxonomy Consulting Taxonomy / Metadata
(Taxonomy Warehouse) -- Management Tool
Analysis
Taxonomy customization
Taxonomy development Recommendations
Implementation
Workshops
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 3: Some Definitions
A taxonomy is a hierarchical topic structure to which information can be assigned through the dual
processes of classification (filing to a location) and categorisation (tagging with relevant metadata). A
taxonomy provides browsable navigation and supports filtered searching
A thesaurus is a controlled vocabulary linking an organisation’s common language to its taxonomy
structure. It accommodates synonyms, acronyms, language variants and other near equivalences. It also
signposts non-hierarchical linkages within and across the taxonomy facets. A thesaurus is usually
employed to interpret and guide user search queries
An ontology is the working model of entities and interactions in a particular domain of knowledge or
content set. It is a set of concepts - such as things, events, and relations - that are specified in some way
in order to create an agreed-upon vocabulary for exchanging information. An ontology is increasingly used
to visualise (or map) a set of search results and discover new or hidden connections
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 4: UP Multi-
SIDEWAYS Directional
DOWN
Classic taxonomy… Traditional thesaurus… Emerging ontology…
groups things or captures the different shows a network of
concepts into families names of the family multi-dimensional
members and explores relationships and
some more distant properties both within and
associations outside the family groups
(cousins & close friends)
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 5: UP Multi-
SIDEWAYS Directional
DOWN
Telephones Mobile Phones Mobile Phones
Is a broader term than AKA as
Cell Phones & Are made by
Mobile Phones
Hand Phones Phone Manufacturers
And Similar to And use the networks of
Hand Held Devices Telecoms
& PDAs Service Providers
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 6: •Metadata’s Evolutionary Path
Controlled Ontologies
Vocabulary
Thesauri
Hierarchical
Taxonomies
Structured Metadata is evolving
Authority Files organically – the less
complex metadata elements
Dictionaries
form the building blocks for
& Flat Lists creating the more complex
structures
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 7: Practical Applications
Portal navigation and browsable website menus
Conceptual access to large databases
Records management and cataloging
e-Commerce online product catalogues
Inventory control and de-duplication
Auto-classification of internal documents and email
Multilingual search and browse
Metasearch of enterprise-wide resources
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 8: Centralized Taxonomy and Metadata Management
As a centralized repository for multi-lingual semantic management that is:
- Independent from systems like web-portal search and categorization systems
- Scalable; capable of evolving with emerging corporate semantic standards
Synaptica® HTML
P
e CSV
Portals
Portals
Multiple users r Categorizers
m XML
working in i Centralized
collaborative and s Taxonomy
ZThes Portals
Portals
compartmentalized s SKOS Search Engines
i
Management
space System OWL
o
n Web Portals
Portals
s Services Content Portals
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 9: Why Centralized?
Metadata can transcend information islands and
data silos but only if the enterprise is committed to
common standards
A centralized system that supports both
collaboration and compartmentalization allows
common metadata to be shared while also
allowing user communities the independence to
manage specialized metadata files
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 10: Why Independent?
Enterprises are increasingly making use of
multiple proprietary and open source software
tools for categorization, search and portal tasks
While many of these tools support some level of
metadata management the diversity of standards,
data formats and business rules they support can
actually result in exacerbating the data silo
problem by creating metadata silos
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 11: Where taxonomy fits with Search
Search Engine
Taxonomy & Metadata Platform
Information Processing, Management and Storage
Shared News &
DMS CMS Data
Docs Research
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 12: 4 Good Reasons for Taxonomy
Search Relevancy Knowledge Worker Productivity
Search Completeness Effective Research/Risk Mitigation
Search Federation Better & Faster Decisions
Search Visualisation Discovery & Innovation
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 13: 1. Improved Search Relevancy
Ambiguity of Language
Is a Blackberry a fruit or a handheld device?
By including this brand name in a taxonomy we
can give context to the user search query
In a telecoms domain we can assume that the user
means the latter and only return content tagged as
such
Alternatively we can weight the results, promoting
those documents about handheld devices above
those that refer to the fruit
Either way the result is increased search precision
which translates into time savings
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 14: 2. Improved Search Completeness
Synonymous and Related Term Relationships
Mobile Phone (PT) = Cell Phone (NPT) = Hand Phone
(NPT)
Mobile Phone is related to Hand Held Device (RT)
User Search Query = “Cell Phones”
The taxonomy simultaneously broadens the search and
prioritises the returned results giving increased recall
without compromising relevancy
Content tagged with Mobile Phone category are promoted
over those not tagged using a weighting in the search
algorithm
Content tagged with Hand Held Device category may also
receive a weighting
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 15: 3. Search federation and data integration
A snapshot or dashboard is often more desirable
than a list of document titles or snippets,
especially when looking for information on a
customer, supplier or competitor
Also, information will most likely reside in a
number of internal repositories, each with their
own levels of metadata structure
Taxonomy allows the combination of news,
internal CI reports, price plans, coverage data,
market share data, share price etc. in one
consolidated view by providing mappings or
cross-walks
This is essentially applying business intelligence
discipline to the world of unstructured information
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 16: 4. Search Visualisation
The previous three
scenarios assume the user
knows what they are
looking for
But what about
serendipitous discovery?
By being able see across
an aggregation of content
and extract facts and
relationships from deep
within the information
stores, true (and
sometimes fortunate)
discovery can take place
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 17: Back End CIOs; CTOs; Front End
Information Structure IT Architects Information Intelligence
Filing & Storage Search
Taxonomies Engine
Thesauri
Metadata
Tagging Ontologies Navigation
(Categorisation)
Synaptica®
Process
Vocabulary & Metadata Visualisation
Management
Document, Content
& Records Intranet / Portal
Management User Interface
Librarians; Taxonomists; Indexers;
Knowledge & Information Managers
Information Creators; Information Users
Records Managers; (the business; the public)
Content Managers;
Librarians; Indexers
© Copyright 2007 Dow Jones and Company, Inc. Proprietary and Confidential |
- Slide 18: Centralized Taxonomy Management for
Enterprise Information Systems
Paula R. McCoy
Manager, Taxonomy Development
ProQuest
paula.mccoy@proquest.com
- Slide 19: Topics of Discussion
Description of ProQuest Controlled
Vocabulary & Authority Files
Taxonomy Management -- Overview
Managing Terms Manually
Synaptica Thesaurus Management System
- Slide 20: Access to over 125 billion digital pages of content from
magazine, trade, & scholarly publications, current &
historical newspapers, original materials such as annual
reports & civil war pamphlets, and daily wire feeds
Subscription-based ProQuest® online information
service available in academic and public libraries
- Slide 21: ProQuest Controlled Vocabulary used to index
subjects; Authority Files used to index
company, geographic, personal, product names
CV applied to non-periodical & third-party
content via mapping, to allow cross-searching
of multiple DBs with one vocabulary
- Slide 22: ProQuest Controlled Vocabulary
Natural language, hierarchical vocabulary
complying
with ANSI/NISO Standard Z39.19 (Guidelines for
the Construction, Format, and Management of
Monolingual Controlled Vocabularies)
Created in 1970s for ABI/INFORM business database
Based on Library of Congress Subject Headings
- Slide 23: ProQuest Controlled Vocabulary
Merged with general reference vocabulary in 1980s
Major development effort in past 4 years to boost
science, education & medical terms
Thesaurus subjects:
Business, economics & trade – 4300 terms
Science, math & technology – 1600 terms
Medicine – 1150 terms
Humanities – 960 terms
Government & policy – 850 terms
Education – 400 terms
- Slide 24: ProQuest CV: Statistics
Preferred terms: 11,046
Non-preferred terms: 5631
Scope Notes: 3194 (29%)
Cross-references (Broader,
Narrower, Related terms): 67,700
Terms added in 2007: 77
Terms added in 2008: 58+
- Slide 25: Authority Files: Statistics
Corporate/Organization Names: 438,098
Names added in 2008: 5489
Personal Names: 416,239
Names added in 2008: 1526
Geographic (Location) Names: 34,331
Names added in 2008: 144
Product Names: 38,210
Names added in 2008: 54
- Slide 26: The Taxonomy Manager’s Job
Add subject terms as dictated by new
concepts and new content to index
Maintain hierarchies & Scope Notes
Load updated Thesaurus to ProQuest interface
Manage authority files to maintain
standards
& control file size
- Slide 27: The Taxonomy Manager’s Job
OBJECTIVE:
To ensure that indexers and searchers alike have access to
a complete and accurate Thesaurus that they can use to
maximize the discoverability of documents in ProQuest
- Slide 28: Sample Subject Term
Preferred, or main term
Scope note defining term
and how it is used
Chronic obstructive pulmonary disease
SN: Any lung disease, such as chronic bronchitis or
emphysema, causing obstruction of bronchial airflow Non-preferred term: points
UF COPD to term used to index
BT Disease Terms broader in nature to
BT Respiratory diseases main term: COPD is a
NT Asthma disease, and specifically, a
NT Bronchitis respiratory disease
NT Emphysema Terms narrower in nature
RT Airway management to main term: these are
RT Lungs chronic lung diseases
Terms related to main
term that might be used to
narrow the search
- Slide 29: Managing Terms Manually
Seven MS Word vocabulary documents—
English and foreign language (French, German,
Spanish)—printed for internal use only
Six authority files & 3 vocabulary files in Oracle
databases, requiring duplicate entry of subject
terms in Word and Oracle
Legacy editorial system in process of being
replaced
New scientific content requiring a huge
enhancement to vocabulary
- Slide 30: Thesaurus Management Systems
Thesaurus Management System:
Buying Criteria
Requirements
Eliminate double entry
Automate entry of reciprocal relationships
Improve editorial interface with vocabulary
- Slide 31: Life With Synaptica
Word – Old, Bad Synaptica – New, Good
- Slide 32: Adding Terms Today: 3 Easy Steps
1. Enter term and relationships into Synaptica
“Item Details” window
2. Export report of new terms into Word
3. Send Word document to editors
- Slide 33: Improving Thesaurus Management
Categories Feature
- Slide 34: Subject Term Categories
- Slide 35: CORP Names – Categories & Website
- Slide 36: Foreign-Language Vocabularies
Language
Equivalents
- Slide 37: Life With Synaptica
Foreign-Language Vocabularies
Spanish
Spanish
Alphabetical
by language
German French
- Slide 38: Synaptica Updates
Synaptica version 6.0 released in early 2006
Synaptica version 7.0 is being implemented now:
• Enhanced user interface
• Semantic Web standardization (RDF, OWL, SKOS) and
Web Services integration
• Expanded Reporting functionality
• Enhanced adding and editing of term relationships
including “rapid-fire” simple drag-and-drop editing
• Improved global term editing
• Online help and user guides
- Slide 39: Benefits of Synaptica
Greater awareness of thesaurus standards and
terminology, e.g.: “preferred” and “non-preferred”
instead of Use and Used For
Long-needed updating and improvement in term
hierarchies; ability to provide thesaurus statistics
Increase in Company name NPTs — from 1935 to
8952 today
Immediate responsiveness to indexer needs —
real-time term additions, esp. NPTs and SNs
Easier loading of updated Thesaurus on PQ
interface
- Slide 40: thank you!