Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Seamless Searching of Numeric and Textual Resources Michael Buckland School of Information Management and Systems University of California, Berkeley
  2. 2. The Significance of Vocabulary • An economic claim: Vocabulary problems reduce the benefits and return on investment in information services. • Vocabulary is used for indexicality, therefore issues of identity are central to LIS. • Vocabulary is central to digital libraries. • Vocabulary central to explaining the history of conceptions of LIS!
  3. 3. God --- Knowableness --- History of doctrines --- Early church, ca. 30-600 --- Congresses.
  4. 4. Economic Rationale: • Massive investment in repositories • Large investment in categorization schemes: classifications, thesauri, concept codes, headings, … • Categorization schemes usually specialized and stylized • Increasingly unfamiliar to searchers, hence ineffective, inefficient use
  5. 5. Remedy Support for searching unfamiliar metadata vocabularies: Interface to translate searcher’s vocabulary into system’s vocabulary.
  6. 6. Examples Automobile import, export data (Census Bureau) Automobiles? No data. Cars? “Railway or tramway stock” (Passenger motor vehicles, spark ignition engine.)
  7. 7. TL 205 180/280 3711 in Library of Congress Classification in U.S. Patent Classification in Standard Industrial Classification
  8. 8. Example: Coastal pollution F SU COASTAL POLLUTION 0 F TW COASTAL POLLUTION SUMMARIZE SUBJECTS LCSH Marine pollution Coastal zone management Water --- Pollution Petroleum industry and trade Beach erosion Coasts Barrier islands MeSH Seawater Water pollution Bacteria Water microbiology Air pollution Environmental monitoring Bathing beaches
  9. 9. International Harmonized Commodity Classification System: “Computer” • HS 84: “Nuclear reactors, boilers, machines and mechanical appliances” • HS 8471: “Automatic data processing machines and units thereof, magnetic or optical readers, machines for transcribing data” • HS 847120: “Digital auto data proc mach contng in the same housing a CPU and input & output device”
  10. 10. INSPEC Thesaurus subdomain- based indexes: • “Water” subdomain: Fission reactor safety; Fission reactor fuel; Polymers; Organic insulating materials; Water supply; Cable insulation; Insulation testing; and Insulating oils. • “Biology” subdomain: Water; Biomechanics; Physiological models; Neurophysiology; Cellular effects of radiation. • “Information Studies” subdomain: Agriculture; Natural resources; Forecasting theory; Operations research; Erosion.
  11. 11. Example: Vietnam War. U.C. MELVYL Online Catalog FIND XSU VIETNAM WAR Search Results: 0 records FIND XSU VIETNAMESE CONFLICT Search Results: 4,190 records
  12. 12. Emanuel Goldberg: Aerial photography using a “Drachen” Example: Tethered balloons. English: Aerostat. German: Drachen (= Kite in dictionary)
  13. 13. “Entry vocabulary” search interfaces: • Software and algorithms map natural language vocabulary to specialized metadata terms. • Allows users to enter ordinary language queries while taking advantage of existing subject headings, categorization • Uses co-occurrence statistics to link users’ ordinary language terms to system vocabularies • Statistical association between lexical items in titles and abstracts and the system’s metadata vocabulary • Suggests most likely system vocabulary
  14. 14. Thesaurus navigation • Facilitates browsing where structure is present: Broader, narrower, related terms • Guides searcher to other parts of the structure Retrieval set analysis • Navigation within micro-domain
  15. 15. Web access: WWW forms-based application supported by Perl Supports searches on remote repositories Four subdomain dictionaries in three databases --- BIOSIS (Biological abstracts): subdomain “water” --- INSPEC: subdomains: “information science”, “water” --- U.S. Patent Office classification
  16. 16. Statement of work: • Varied prototype Entry Vocabulary Modules. • Unintrusive development of EVMs by agents • Sensitivity to subdomains. • Natural language processing to augment statistical term frequency. • Recommendations for metadata “codebooks” for numeric databases. •