Loading...
Flash Player 9 (or above) is needed to view slideshows. We have detected that you do not have it on your computer.To install it, go here
Finding a Common Language: Bringing Complex and Disparate Vocabularies Together
This case study addresses the challenges ProQuest faced in managing multilingual controlled vocabularies using multiple Word documents and authority files maintained in an Oracle database. Speakers describe how implementing a thesaurus management tool helped ProQuest simplify and standardize its business semantic management to create a common language and connect disparate information assets as well as handling large and varied vocabularies and authority files, linking new and existing editorial systems and enabling hierarchical views, and automating thesaurus management tasks.
202 views | comments | 0 favorites | 9 downloads | 2 embeds (Stats)
More Info
This slideshow is Public
Total Views: 202 on Slideshare: 192 from embeds: 10
Most viewed embeds (Top 5):
More
Slideshow Transcript
- Slide 1: Finding a Common Language:
Bringing Complex and Disparate
Vocabularies Together
Paula R. McCoy
Manager, Taxonomy Development
ProQuest
paula.mccoy@proquest.com
- Slide 2: Part of Cambridge Information Group & CSA
Headquartered in Ann Arbor, Michigan
Editorial offices in Louisville, Kentucky
- Slide 3: Access to over 125 billion digital pages of content from
magazine, trade, & scholarly publications, current &
historical newspapers, original materials such as annual
reports & civil war pamphlets, and daily wire feeds
Subscription-based ProQuest® online information
service available in academic and public libraries
- Slide 4: Louisville editors abstract & index 4,000+
periodicals & newspapers
ProQuest Controlled Vocabulary used to index
subjects; Authority Files used to index
company, geographic, personal, product names
CV applied to non-periodical & third-party
content via mapping, to allow cross-searching
of multiple DBs with one vocabulary
- Slide 5: Topics of Discussion
Description of ProQuest Controlled
Vocabulary & Authority Files
Taxonomy Management -- Overview
Life Before Synaptica
Thesaurus Management System Purchase
Implementing Synaptica
Life With Synaptica
Q&A
- Slide 6: PQ CV
ProQuest Controlled Vocabulary
Natural language, hierarchical vocabulary
complying
with ANSI/NISO Standard Z39.19 (Guidelines for
the Construction, Format, and Management of
Monolingual Controlled Vocabularies)
Created in 1970s for ABI/INFORM business database
Based on Library of Congress Subject Headings
- Slide 7: PQ CV
ProQuest Controlled Vocabulary
Merged with general reference vocabulary in 1980s
Major development effort in past 4 years to boost
science, education & medical terms
Thesaurus subjects:
Business, economics & trade – 4300 terms
Science, math & technology – 1600 terms
Medicine – 1150 terms
Humanities – 960 terms
Government & policy – 850 terms
Education – 400 terms
- Slide 8: PQ CV
ProQuest CV: Statistics
Preferred terms: 11,046
Non-preferred terms: 5631
Scope Notes: 3194 (29%)
Cross-references (Broader,
Narrower, Related terms): 67,700
Terms added in 2007: 77
Terms added in 2008: 58+
- Slide 9: PQ CV
Authority Files: Statistics
Corporate/Organization Names: 438,098
Names added in 2008: 5489
Personal Names: 416,239
Names added in 2008: 1526
Geographic (Location) Names: 34,331
Names added in 2008: 144
Product Names: 38,210
Names added in 2008: 54
- Slide 10: Taxonomy Management
The Taxonomy Manager’s Job
Add subject terms as dictated by new
concepts & new content to index
Maintain hierarchies & Scope Notes
Load updated Thesaurus to ProQuest interface
Manage authority files to maintain
standards
& control file size
- Slide 11: Taxonomy Management
The Taxonomy Manager’s Job
OBJECTIVE:
To ensure that indexers and searchers alike have access to
a complete and accurate Thesaurus that they can use to
maximize the discoverability of documents in ProQuest
- Slide 12: Taxonomy Management
Thesaurus on ProQuest®
- Slide 13: Taxonomy Management
Sample Subject Term
Preferred, or main term
Scope note defining term
and how it is used
Chronic obstructive pulmonary disease
SN: Any lung disease, such as chronic bronchitis or
emphysema, causing obstruction of bronchial airflow Non-preferred term: points
UF COPD to term used to index
BT Disease Terms broader in nature to
BT Respiratory diseases main term: COPD is a
NT Asthma disease, and specifically, a
NT Bronchitis respiratory disease
NT Emphysema Terms narrower in nature
RT Airway management to main term: these are
RT Lungs chronic lung diseases
Terms related to main
term that might be used to
narrow the search
- Slide 14: Before Synaptica
Managing terms meant:
Multiple files Duplicate entries Errors
= less than ideal thesaurus management
- Slide 15: Before Synaptica
MS Word Document
Version 2004 ProQuest Controlled Vocabulary of Subject Terms Page 3
Academic degrees Academic guidance counseling Academic underachievement
SN: A title conferred on students upon UF: Guidance counseling SN: Student performance that is below
graduating from a program of Student counseling standards or below potential
study at a college or university BT: Counseling RT: Academic achievement
UF: Associates degree Education Academic achievement gaps
Bachelors degree RT: Career preparation Academic failure
Doctoral degree Counselor client relationships Academic standards
Masters degree Counselor education At risk students
BT: Academic achievement School counseling Grade repetition
RT: Colleges & universities Social promotion
Graduate studies Academic libraries
Graduation requirements UF: College libraries Academy awards
Higher education School libraries UF: Oscars (Motion picture awards)
MBA programs & graduates BT: Libraries BT: Awards & honors
RT: Librarians Motion picture industry
Academic failure Library resources RT: Actors
SN: The failure of a student to meet
academic standards, including Academic marketing Acadian culture
failure to be promoted or to SN: Efforts of educational institutions UF: Cajuns
graduate to attract students and funding BT: Minority & ethnic groups
UF: Student failure BT: Marketing
RT: Academic achievement NT: Student recruitment Accelerated cost recovery system
Academic grading RT: Admissions policies CC: 4210
Academic probation College admissions UF: ACRS
Academic underachievement College choice BT: Cost recovery
At risk students Colleges & universities Depreciation
Grade repetition Enrollment management Depreciation methods
Graduation requirements Enrollments NT: Modified accelerated cost
School dropouts recovery system
Social promotion Academic probation RT: Capital cost recovery allowances
RT: Academic failure Declining balance method
Academic freedom Academic grading Depreciable assets
SN: Educators’ freedom to teach and Academic underachievement Tax basis
research what they choose
BT: Education Academic standards Accelerated death benefits
RT: Colleges & universities SN: Standards for performance in CC: 4220
Curricula defined academic areas set at the CC: 8210
Research local, state, or federal levels UF: Living benefits
Teachers BT: Standards Viatical settlement
Teaching RT: Academic achievement BT: Death benefits
Academic achievement gaps RT: Estate planning
Academic grading Academic underachievement Hardship distributions
UF: Grading of students Achievement tests Insurance policies
BT: Academic achievement Core curriculum Life insurance
RT: Academic failure Education policy Riders
Academic probation Educational evaluation Terminal illnesses
Achievement tests No Child Left Behind Act 2001-US
Cheating Quality of education
Education portfolios School effectiveness Accelerated depreciation methods
Educational evaluation Standardized tests USE: Depreciation methods
Tests
Ke y: S N=Sc o pe no te CC=Clas s ific atio n c o de UF=Us e fo r
BT=Bro ade r te rm NT=Narro we r te rm RT=Re late d te rm
- Slide 16: Before Synaptica
Vocabulary Documents in Word
ProQuest controlled vocabulary
French-language controlled vocabulary
German-language controlled vocabulary
Spanish-language controlled vocabulary
Combined PQ-CBCA controlled vocabulary
Ethnic database vocabulary, English
Ethnic database vocabulary, Spanish
- Slide 17: Before Synaptica
Oracle Database Forms
- Slide 18: Before Synaptica
Authority Files in Oracle
Class codes (related to subjects)
CORP names (391,665+ terms)
NAIC codes (related to companies)
GEOG names (32,000+ terms)
PERS names (350,000+ terms)
PROD names (38,000+ terms)
- Slide 19: Before Synaptica
Foreign-Language Vocabularies
French
German
Spanish
- Slide 20: Before Synaptica
Adding New Terms
1. Enter full term hierarchy into new Word doc
2. Copy term into main Word-based vocabulary &
enter reciprocal relationships
3. Enter term & relationships into Oracle
4. Review next-day report on Oracle activity
5. Send new term doc to editors via e-mail
6. Print new vocabulary (at least every two years)
- Slide 21: TMS Purchase
Thesaurus Management Systems
- Slide 22: TMS Purchase
Buying Criteria
Buying Criteria
Up to 40 admin & 100 read-only users in multiple
locations
Ability to load vocabs from multiple Word docs &
Oracle authority files
Ability to add new vocabularies
Support for foreign language vocabularies
Vendor onsite installation & training
Software upgrades & tech support
- Slide 23: TMS Purchase
Buying Criteria
1. Ability to interact in real time with
editorial system
2. Ability to accommodate authority
files of 400,000+ names
- Slide 24: Implementing Synaptica
Implementing Synaptica
Contract signed and work begun in August 2004
PQ sent to Synaptica all the Word & Oracle files for
analysis
Decision points: how to load & structure data;
how to handle “suspect” or erroneous
relationships
- Slide 25: Implementing Synaptica
Synaptica Data Analysis
Relationship Validation Tests:
Term Uniqueness
Use Violations
Self-Referencing Relationships
One Relationship per Term Pair
Relationship Unique
Relationship Reciprocates
Circular References
Exception Reports delivered to PQ; Errors fixed before production
- Slide 26: Implementing Synaptica
Use Validation Error
Marine resources Marine ecology
SN: The ecology of the seas and oceans
UF: Benthic ecology
BT: Ecology
RT: Marine conservation
Marine pollution
Marine resources
Oceans
Marine resources
USE: Underwater resources
Marine pollution
BT: Pollution
Water pollution
RT: Marine conservation
Underwater resources Marine ecology
UF: Marine resources Ocean dumping
BT: Natural resources Marine resources
RT: Marine conservation
Marine ecology
Marine pollution
- Slide 27: Implementing Synaptica
Foreign-Language Errors
Terms with no language equivalent (LEQ), e.g., no translation
In all 3 languages, multiple English terms with the same
translation, e.g.:
English term French term French term-revised
Purchasing Achats
Shopping Achats Shopping
Buyers Acheteurs
Purchasing agents Acheteurs Agents d'achat
- Slide 28: Implementing Synaptica
Final Challenge
Issue: Different editorial systems = 2x data
entry: once for Synaptica, once for Oracle
Solution: Overnight synchronization process to copy
Synaptica work into Oracle every night
Synch process discontinued April 2008
- Slide 29: Implementing Synaptica
Putting Synaptica Into Production
Nov 2004
Train users — provide documentation & hands-on
demonstrative training
Deal with people resistant to change
Encourage written feedback on system functionality
Send feedback to Synaptica – many of our suggestions
implemented in later versions
- Slide 30: Life With Synaptica
Life With Synaptica
Word – Old, Bad Synaptica – New, Good
- Slide 31: Life With Synaptica
Adding Terms Today: 3 Easy Steps
1. Enter term and relationships into Synaptica
“Item Details” window
2. Export report of new terms into Word
3. Send Word document to editors
- Slide 32: Life With Synaptica
Synaptica Updates
Synaptica version 6.0 released in early 2006
Synaptica version 7.0 is being implemented now:
• Enhanced user interface
• Semantic Web standardization (RDF, OWL, SKOS) and
Web Services integration
• Expanded Reporting functionality
• Enhanced adding and editing of term relationships
including “rapid-fire” simple drag-and-drop editing
• Improved global term editing
• Online help and user guides
- Slide 33: Life With Synaptica
Benefits of Synaptica
Greater awareness of thesaurus standards and
terminology, e.g.: “preferred” and “non-
preferred”
instead of Use and Used For
Long-needed updating and improvement in term
hierarchies; ability to provide thesaurus statistics
Increase in Company name NPTs — from 1935 to
8952 today
Immediate responsiveness to indexer needs —
real-time term additions, esp. NPTs and SNs
Easier loading of updated Thesaurus on PQ
interface