Paula R. McCoy Manager, Taxonomy Development ProQuest [email_address] Finding a Common Language:  Bringing Complex and Disparate Vocabularies Together
Part of Cambridge Information Group & CSA Headquartered in Ann Arbor, Michigan Editorial offices in Louisville, Kentucky
Access to over 125 billion digital pages of content from magazine, trade, & scholarly publications, current & historical newspapers, original materials such as annual reports & civil war pamphlets, and daily wire feeds Subscription-based  ProQuest®  online information  service available in academic and public libraries
Louisville editors abstract & index 4,000+  periodicals & newspapers ProQuest Controlled Vocabulary used to index  subjects; Authority Files used to index  company, geographic, personal, product names CV applied to non-periodical & third-party  content via mapping, to allow cross-searching  of multiple DBs with one vocabulary
Description of ProQuest Controlled  Vocabulary & Authority Files Taxonomy Management  -- Overview Life Before Synaptica Thesaurus Management System Purchase Implementing Synaptica Life With Synaptica Topics of Discussion Q&A
ProQuest Controlled Vocabulary PQ CV Created in 1970s for ABI/INFORM business database Based on Library of Congress Subject Headings Natural language, hierarchical vocabulary complying  with ANSI/NISO Standard Z39.19 (Guidelines for  the Construction, Format, and Management of  Monolingual Controlled Vocabularies)
ProQuest Controlled Vocabulary Thesaurus subjects: Business, economics & trade – 4300 terms Science, math & technology – 1600 terms Medicine – 1150 terms Humanities – 960 terms Government & policy – 850 terms Education – 400 terms Merged with general reference vocabulary in 1980s Major development effort in past 4 years to boost  science, education & medical terms PQ CV
ProQuest CV: Statistics Preferred terms: 11,046 Non-preferred terms: 5631 Scope Notes: 3194 (29%) Cross-references (Broader,  Narrower, Related terms): 67,700 Terms added in 2007: 77 Terms added in 2008: 58+ PQ CV
Authority Files: Statistics Corporate/Organization Names: 438,098 Names added in 2008: 5489 Personal Names: 416,239 Names added in 2008: 1526 Geographic (Location) Names: 34,331 Names added in 2008: 144 Product Names: 38,210 Names added in 2008: 54 PQ CV
The Taxonomy Manager’s Job Add subject terms as dictated by new  concepts & new content to index Maintain hierarchies & Scope Notes Load updated Thesaurus to ProQuest interface Manage authority files to maintain standards  & control file size Taxonomy Management
The Taxonomy Manager’s Job Taxonomy Management To ensure that indexers and searchers alike have access to a complete and accurate Thesaurus that they can use to maximize the discoverability of documents in ProQuest OBJECTIVE :
Thesaurus on ProQuest® Taxonomy Management
Sample Subject Term Taxonomy Management Chronic obstructive pulmonary disease   SN: Any lung disease, such as chronic bronchitis or emphysema, causing obstruction of bronchial airflow     UF  COPD     BT  Disease     BT  Respiratory diseases     NT  Asthma     NT  Bronchitis     NT  Emphysema     RT  Airway management     RT  Lungs Preferred, or main term Scope note defining term and how it is used Non-preferred term: points to term used to index Terms broader in nature to main term:  COPD is a disease, and specifically, a respiratory disease Terms narrower in nature to main term:  these are chronic lung diseases Terms related to main term that might be used to narrow the search
Before Synaptica Managing terms meant: Multiple files    Duplicate entries    Errors  = less than ideal thesaurus management
MS Word Document Before Synaptica Academic degrees SN: A title conferred on students upon  graduating from a program of  study at a college or university UF: Associates degree Bachelors degree Doctoral degree Masters degree BT: Academic achievement RT: Colleges & universities Graduate studies Graduation requirements Higher education  MBA programs & graduates   Academic failure SN: The failure of a student to meet  academic standards, including  failure to be promoted or to  graduate UF: Student failure RT: Academic achievement Academic grading Academic probation Academic underachievement At risk students Grade repetition Graduation requirements School dropouts Social promotion Academic freedom SN: Educators’ freedom to teach and  research what they choose BT: Education RT: Colleges & universities Curricula Research Teachers Teaching Academic grading UF: Grading of students BT: Academic achievement RT: Academic failure Academic probation Achievement tests Cheating Education portfolios Educational evaluation Tests Version 2004 ProQuest Controlled Vocabulary of Subject Terms   Page 3 Academic guidance counseling UF: Guidance counseling Student counseling BT: Counseling Education RT: Career preparation Counselor client relationships Counselor education School counseling    Academic libraries UF: College libraries School libraries BT: Libraries RT: Librarians Library resources   Academic marketing SN: Efforts of educational institutions  to attract students and funding BT: Marketing NT: Student recruitment RT: Admissions policies College admissions College choice Colleges & universities Enrollment management Enrollments Academic probation RT: Academic failure Academic grading Academic underachievement Academic standards SN: Standards for performance in  defined academic areas set at the  local, state, or federal levels BT: Standards RT: Academic achievement Academic achievement gaps Academic underachievement Achievement tests Core curriculum Education policy Educational evaluation No Child Left Behind Act 2001-US Quality of education School effectiveness Standardized tests Academic underachievement SN: Student performance that is below  standards or below potential RT: Academic achievement Academic achievement gaps Academic failure Academic standards At risk students Grade repetition Social promotion   Academy awards UF: Oscars (Motion picture awards) BT: Awards & honors Motion picture industry RT: Actors   Acadian culture UF: Cajuns BT: Minority & ethnic groups Accelerated cost recovery system CC: 4210 UF: ACRS BT: Cost recovery Depreciation Depreciation methods NT: Modified accelerated cost  recovery system RT: Capital cost recovery allowances Declining balance method Depreciable assets Tax basis   Accelerated death benefits CC: 4220 CC: 8210 UF: Living benefits Viatical settlement BT: Death benefits RT: Estate planning Hardship distributions Insurance policies Life insurance Riders Terminal illnesses Accelerated depreciation methods USE: Depreciation methods Key:  SN=Scope note  CC=Classification code  UF=Use for    BT=Broader term  NT=Narrower term  RT=Related term
Vocabulary Documents in Word ProQuest controlled vocabulary French-language controlled vocabulary German-language controlled vocabulary Spanish-language controlled vocabulary Combined PQ-CBCA controlled vocabulary Ethnic database vocabulary, English Ethnic database vocabulary, Spanish  Before Synaptica
Oracle Database Forms Before Synaptica
Authority Files in Oracle Class codes (related to subjects) CORP names (391,665+ terms) GEOG names (32,000+ terms) PERS names (350,000+ terms) PROD names (38,000+ terms) NAIC codes (related to companies) Before Synaptica
Foreign-Language Vocabularies French German Spanish Before Synaptica
Adding New Terms 1.  Enter full term hierarchy into new Word doc 2.  Copy term into main Word-based vocabulary &  enter reciprocal relationships 3.  Enter term & relationships into Oracle 4.  Review next-day report on Oracle activity 5.  Send new term doc to editors via e-mail 6.  Print new vocabulary (at least every two years) Before Synaptica
Thesaurus Management Systems TMS Purchase
Buying Criteria TMS Purchase Up to 40 admin & 100 read-only users in multiple  locations Ability to load vocabs from multiple Word docs &  Oracle authority files  Support for foreign language vocabularies Ability to add new vocabularies Vendor onsite installation & training Software upgrades & tech support Buying Criteria
1.  Ability to interact in real time with editorial system 2.  Ability to accommodate authority  files of 400,000+ names TMS Purchase Buying Criteria
Implementing Synaptica Contract signed and work begun in August 2004 PQ sent to Synaptica all the Word & Oracle files for  analysis Implementing Synaptica Decision points: how to load & structure data;  how to handle “suspect” or erroneous  relationships
Synaptica Data Analysis Term Uniqueness Use Violations Self-Referencing Relationships One Relationship per Term Pair Relationship Unique Circular References Relationship Reciprocates Relationship Validation Tests: Exception Reports delivered to PQ; Errors fixed before production Implementing Synaptica
Use Validation Error Marine resources Implementing Synaptica Underwater resources UF: Marine resources BT: Natural resources RT: Marine conservation Marine ecology Marine pollution Marine pollution BT: Pollution Water pollution RT: Marine conservation Marine ecology Ocean dumping Marine resources Marine ecology SN: The ecology of the seas and oceans UF: Benthic ecology BT: Ecology RT: Marine conservation Marine pollution Marine resources Oceans Marine  resources USE: Underwater resources
Terms with no language equivalent (LEQ), e.g., no translation In all 3 languages, multiple English terms with the same translation, e.g.: Foreign-Language Errors Implementing Synaptica English term Purchasing Shopping Buyers Purchasing   agents French term Achats Achats Acheteurs Acheteurs French term-revised Shopping Agents d'achat
Solution: Issue: Different editorial systems  =  2x data entry: once for Synaptica, once for Oracle Final Challenge Implementing Synaptica Overnight synchronization process to copy Synaptica work into Oracle every night Synch process discontinued April 2008
Putting Synaptica Into Production Deal with people resistant to change Train users — provide documentation & hands-on  demonstrative training Encourage written feedback on system functionality Send feedback to Synaptica – many of our suggestions  implemented in later versions Nov 2004 Implementing Synaptica
Life With Synaptica Word – Old, Bad   Synaptica – New, Good   Life With Synaptica
2.  Export report of new terms into Word 1.  Enter term and relationships into Synaptica  “ Item Details” window 3.  Send Word document to editors Life With Synaptica Adding Terms Today: 3 Easy Steps
Synaptica version 6.0 released in early 2006 Life With Synaptica Synaptica Updates Synaptica version 7.0 is being implemented now: Enhanced user interface  Semantic Web standardization (RDF, OWL, SKOS) and  Web Services integration Expanded Reporting functionality  Enhanced adding and editing of term relationships  including “rapid-fire” simple drag-and-drop editing Improved global term editing Online help and user guides
Benefits of Synaptica Life With Synaptica Greater awareness of thesaurus standards and  terminology, e.g.: “preferred” and “non-preferred”  instead of Use and Used For Long-needed updating and improvement in term  hierarchies; ability to provide thesaurus statistics Increase in Company name NPTs — from 1935 to  8952 today Immediate responsiveness to indexer needs —  real-time term additions, esp. NPTs and SNs Easier loading of updated Thesaurus on PQ interface

Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

  • 1.
    Paula R. McCoyManager, Taxonomy Development ProQuest [email_address] Finding a Common Language: Bringing Complex and Disparate Vocabularies Together
  • 2.
    Part of CambridgeInformation Group & CSA Headquartered in Ann Arbor, Michigan Editorial offices in Louisville, Kentucky
  • 3.
    Access to over125 billion digital pages of content from magazine, trade, & scholarly publications, current & historical newspapers, original materials such as annual reports & civil war pamphlets, and daily wire feeds Subscription-based ProQuest® online information service available in academic and public libraries
  • 4.
    Louisville editors abstract& index 4,000+ periodicals & newspapers ProQuest Controlled Vocabulary used to index subjects; Authority Files used to index company, geographic, personal, product names CV applied to non-periodical & third-party content via mapping, to allow cross-searching of multiple DBs with one vocabulary
  • 5.
    Description of ProQuestControlled Vocabulary & Authority Files Taxonomy Management -- Overview Life Before Synaptica Thesaurus Management System Purchase Implementing Synaptica Life With Synaptica Topics of Discussion Q&A
  • 6.
    ProQuest Controlled VocabularyPQ CV Created in 1970s for ABI/INFORM business database Based on Library of Congress Subject Headings Natural language, hierarchical vocabulary complying with ANSI/NISO Standard Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies)
  • 7.
    ProQuest Controlled VocabularyThesaurus subjects: Business, economics & trade – 4300 terms Science, math & technology – 1600 terms Medicine – 1150 terms Humanities – 960 terms Government & policy – 850 terms Education – 400 terms Merged with general reference vocabulary in 1980s Major development effort in past 4 years to boost science, education & medical terms PQ CV
  • 8.
    ProQuest CV: StatisticsPreferred terms: 11,046 Non-preferred terms: 5631 Scope Notes: 3194 (29%) Cross-references (Broader, Narrower, Related terms): 67,700 Terms added in 2007: 77 Terms added in 2008: 58+ PQ CV
  • 9.
    Authority Files: StatisticsCorporate/Organization Names: 438,098 Names added in 2008: 5489 Personal Names: 416,239 Names added in 2008: 1526 Geographic (Location) Names: 34,331 Names added in 2008: 144 Product Names: 38,210 Names added in 2008: 54 PQ CV
  • 10.
    The Taxonomy Manager’sJob Add subject terms as dictated by new concepts & new content to index Maintain hierarchies & Scope Notes Load updated Thesaurus to ProQuest interface Manage authority files to maintain standards & control file size Taxonomy Management
  • 11.
    The Taxonomy Manager’sJob Taxonomy Management To ensure that indexers and searchers alike have access to a complete and accurate Thesaurus that they can use to maximize the discoverability of documents in ProQuest OBJECTIVE :
  • 12.
    Thesaurus on ProQuest®Taxonomy Management
  • 13.
    Sample Subject TermTaxonomy Management Chronic obstructive pulmonary disease SN: Any lung disease, such as chronic bronchitis or emphysema, causing obstruction of bronchial airflow    UF  COPD    BT  Disease    BT  Respiratory diseases    NT  Asthma    NT  Bronchitis    NT  Emphysema    RT  Airway management    RT  Lungs Preferred, or main term Scope note defining term and how it is used Non-preferred term: points to term used to index Terms broader in nature to main term: COPD is a disease, and specifically, a respiratory disease Terms narrower in nature to main term: these are chronic lung diseases Terms related to main term that might be used to narrow the search
  • 14.
    Before Synaptica Managingterms meant: Multiple files  Duplicate entries  Errors = less than ideal thesaurus management
  • 15.
    MS Word DocumentBefore Synaptica Academic degrees SN: A title conferred on students upon graduating from a program of study at a college or university UF: Associates degree Bachelors degree Doctoral degree Masters degree BT: Academic achievement RT: Colleges & universities Graduate studies Graduation requirements Higher education MBA programs & graduates   Academic failure SN: The failure of a student to meet academic standards, including failure to be promoted or to graduate UF: Student failure RT: Academic achievement Academic grading Academic probation Academic underachievement At risk students Grade repetition Graduation requirements School dropouts Social promotion Academic freedom SN: Educators’ freedom to teach and research what they choose BT: Education RT: Colleges & universities Curricula Research Teachers Teaching Academic grading UF: Grading of students BT: Academic achievement RT: Academic failure Academic probation Achievement tests Cheating Education portfolios Educational evaluation Tests Version 2004 ProQuest Controlled Vocabulary of Subject Terms Page 3 Academic guidance counseling UF: Guidance counseling Student counseling BT: Counseling Education RT: Career preparation Counselor client relationships Counselor education School counseling   Academic libraries UF: College libraries School libraries BT: Libraries RT: Librarians Library resources   Academic marketing SN: Efforts of educational institutions to attract students and funding BT: Marketing NT: Student recruitment RT: Admissions policies College admissions College choice Colleges & universities Enrollment management Enrollments Academic probation RT: Academic failure Academic grading Academic underachievement Academic standards SN: Standards for performance in defined academic areas set at the local, state, or federal levels BT: Standards RT: Academic achievement Academic achievement gaps Academic underachievement Achievement tests Core curriculum Education policy Educational evaluation No Child Left Behind Act 2001-US Quality of education School effectiveness Standardized tests Academic underachievement SN: Student performance that is below standards or below potential RT: Academic achievement Academic achievement gaps Academic failure Academic standards At risk students Grade repetition Social promotion   Academy awards UF: Oscars (Motion picture awards) BT: Awards & honors Motion picture industry RT: Actors   Acadian culture UF: Cajuns BT: Minority & ethnic groups Accelerated cost recovery system CC: 4210 UF: ACRS BT: Cost recovery Depreciation Depreciation methods NT: Modified accelerated cost recovery system RT: Capital cost recovery allowances Declining balance method Depreciable assets Tax basis   Accelerated death benefits CC: 4220 CC: 8210 UF: Living benefits Viatical settlement BT: Death benefits RT: Estate planning Hardship distributions Insurance policies Life insurance Riders Terminal illnesses Accelerated depreciation methods USE: Depreciation methods Key: SN=Scope note CC=Classification code UF=Use for BT=Broader term NT=Narrower term RT=Related term
  • 16.
    Vocabulary Documents inWord ProQuest controlled vocabulary French-language controlled vocabulary German-language controlled vocabulary Spanish-language controlled vocabulary Combined PQ-CBCA controlled vocabulary Ethnic database vocabulary, English Ethnic database vocabulary, Spanish Before Synaptica
  • 17.
    Oracle Database FormsBefore Synaptica
  • 18.
    Authority Files inOracle Class codes (related to subjects) CORP names (391,665+ terms) GEOG names (32,000+ terms) PERS names (350,000+ terms) PROD names (38,000+ terms) NAIC codes (related to companies) Before Synaptica
  • 19.
    Foreign-Language Vocabularies FrenchGerman Spanish Before Synaptica
  • 20.
    Adding New Terms1. Enter full term hierarchy into new Word doc 2. Copy term into main Word-based vocabulary & enter reciprocal relationships 3. Enter term & relationships into Oracle 4. Review next-day report on Oracle activity 5. Send new term doc to editors via e-mail 6. Print new vocabulary (at least every two years) Before Synaptica
  • 21.
  • 22.
    Buying Criteria TMSPurchase Up to 40 admin & 100 read-only users in multiple locations Ability to load vocabs from multiple Word docs & Oracle authority files Support for foreign language vocabularies Ability to add new vocabularies Vendor onsite installation & training Software upgrades & tech support Buying Criteria
  • 23.
    1. Abilityto interact in real time with editorial system 2. Ability to accommodate authority files of 400,000+ names TMS Purchase Buying Criteria
  • 24.
    Implementing Synaptica Contractsigned and work begun in August 2004 PQ sent to Synaptica all the Word & Oracle files for analysis Implementing Synaptica Decision points: how to load & structure data; how to handle “suspect” or erroneous relationships
  • 25.
    Synaptica Data AnalysisTerm Uniqueness Use Violations Self-Referencing Relationships One Relationship per Term Pair Relationship Unique Circular References Relationship Reciprocates Relationship Validation Tests: Exception Reports delivered to PQ; Errors fixed before production Implementing Synaptica
  • 26.
    Use Validation ErrorMarine resources Implementing Synaptica Underwater resources UF: Marine resources BT: Natural resources RT: Marine conservation Marine ecology Marine pollution Marine pollution BT: Pollution Water pollution RT: Marine conservation Marine ecology Ocean dumping Marine resources Marine ecology SN: The ecology of the seas and oceans UF: Benthic ecology BT: Ecology RT: Marine conservation Marine pollution Marine resources Oceans Marine resources USE: Underwater resources
  • 27.
    Terms with nolanguage equivalent (LEQ), e.g., no translation In all 3 languages, multiple English terms with the same translation, e.g.: Foreign-Language Errors Implementing Synaptica English term Purchasing Shopping Buyers Purchasing agents French term Achats Achats Acheteurs Acheteurs French term-revised Shopping Agents d'achat
  • 28.
    Solution: Issue: Differenteditorial systems = 2x data entry: once for Synaptica, once for Oracle Final Challenge Implementing Synaptica Overnight synchronization process to copy Synaptica work into Oracle every night Synch process discontinued April 2008
  • 29.
    Putting Synaptica IntoProduction Deal with people resistant to change Train users — provide documentation & hands-on demonstrative training Encourage written feedback on system functionality Send feedback to Synaptica – many of our suggestions implemented in later versions Nov 2004 Implementing Synaptica
  • 30.
    Life With SynapticaWord – Old, Bad  Synaptica – New, Good  Life With Synaptica
  • 31.
    2. Exportreport of new terms into Word 1. Enter term and relationships into Synaptica “ Item Details” window 3. Send Word document to editors Life With Synaptica Adding Terms Today: 3 Easy Steps
  • 32.
    Synaptica version 6.0released in early 2006 Life With Synaptica Synaptica Updates Synaptica version 7.0 is being implemented now: Enhanced user interface Semantic Web standardization (RDF, OWL, SKOS) and Web Services integration Expanded Reporting functionality Enhanced adding and editing of term relationships including “rapid-fire” simple drag-and-drop editing Improved global term editing Online help and user guides
  • 33.
    Benefits of SynapticaLife With Synaptica Greater awareness of thesaurus standards and terminology, e.g.: “preferred” and “non-preferred” instead of Use and Used For Long-needed updating and improvement in term hierarchies; ability to provide thesaurus statistics Increase in Company name NPTs — from 1935 to 8952 today Immediate responsiveness to indexer needs — real-time term additions, esp. NPTs and SNs Easier loading of updated Thesaurus on PQ interface