The Significance of Vocabulary
School of Information Management and Systems
University of California, Berkeley
The Significance of Vocabulary
• An economic claim: Vocabulary problems reduce
the benefits and return on investment in
• Vocabulary is used for indexicality, therefore
issues of identity are central to LIS.
• Vocabulary is central to digital libraries.
• Vocabulary central to explaining the history of
conceptions of LIS!
A correctly formed Library of Congress Subject
heading, but who would think of such search
God --- Knowableness --- History of doctrines ---
Early church, ca. 30-600 --- Congresses.
• Massive investment in repositories
• Large investment in categorization schemes:
classifications, thesauri, concept codes, headings, …
• Categorization schemes usually specialized and
• Increasingly unfamiliar to searchers, hence ineffective,
Support for searching unfamiliar metadata
vocabularies: Interface to translate searcher’s
vocabulary into system’s vocabulary.
Automobile import, export data (Census Bureau)
“Railway or tramway stock”
(Passenger motor vehicles, spark ignition engine.)
“Automobiles”, also know as . . .
in Library of Congress Classification
in U.S. Patent Classification
in Standard Industrial Classification
Example: Coastal pollution
F SU COASTAL POLLUTION 0
F TW COASTAL POLLUTION
Coastal zone management
Water --- Pollution
Petroleum industry and trade
International Harmonized Commodity
Classification System: “Computer”
• HS 84: “Nuclear reactors, boilers, machines and
• HS 8471: “Automatic data processing machines
and units thereof, magnetic or optical readers,
machines for transcribing data”
• HS 847120: “Digital auto data proc mach contng
in the same housing a CPU and input & output
Example: Vietnam War.
U.C. MELVYL Online Catalog
FIND XSU VIETNAM WAR
Search Results: 0 records
FIND XSU VIETNAMESE CONFLICT
Search Results: 4,190 records
Dictionaries don’t always help
Emanuel Goldberg: Aerial photography using
Actual meaning: Aerodynamic tethered balloon.
Standard contemporary English was: Aerostat.
German: Drachen (= Kite in dictionary)
“Entry vocabulary” search interfaces:
• Software and algorithms map natural language
vocabulary to specialized metadata terms.
• Allows users to enter ordinary language queries while
taking advantage of existing subject headings,
• Uses co-occurrence statistics to link users’ ordinary
language terms to system vocabularies
• Statistical association between lexical items in titles and
abstracts and the system’s metadata vocabulary
• Suggests most likely system vocabulary
• Facilitates browsing where structure is
present: Broader, narrower, related terms
• Guides searcher to other parts of the
Retrieval set analysis
• Navigation within micro-domain
Web access: WWW forms-based application
supported by Perl
Supports searches on remote repositories
Four subdomain dictionaries in three databases
--- BIOSIS (Biological abstracts): subdomain
--- INSPEC: subdomains: “information
--- U.S. Patent Office classification
Statement of work:
• Varied prototype Entry Vocabulary Modules.
• Unintrusive development of EVMs by agents
• Sensitivity to subdomains.
• Natural language processing to augment
statistical term frequency.
• Recommendations for metadata “codebooks”
for numeric databases.