Seamless Searching of Numeric and
School of Information Management and Systems
University of California, Berkeley
The Significance of Vocabulary
• An economic claim: Vocabulary problems reduce
the benefits and return on investment in
• Vocabulary is used for indexicality, therefore
issues of identity are central to LIS.
• Vocabulary is central to digital libraries.
• Vocabulary central to explaining the history of
conceptions of LIS!
God --- Knowableness --- History of doctrines ---
Early church, ca. 30-600 --- Congresses.
• Massive investment in repositories
• Large investment in categorization schemes:
classifications, thesauri, concept codes, headings, …
• Categorization schemes usually specialized and
• Increasingly unfamiliar to searchers, hence ineffective,
Support for searching unfamiliar metadata
vocabularies: Interface to translate searcher’s
vocabulary into system’s vocabulary.
Automobile import, export data (Census Bureau)
“Railway or tramway stock”
(Passenger motor vehicles, spark ignition engine.)
in Library of Congress Classification
in U.S. Patent Classification
in Standard Industrial Classification
Example: Coastal pollution
F SU COASTAL POLLUTION 0
F TW COASTAL POLLUTION
Coastal zone management
Water --- Pollution
Petroleum industry and trade
International Harmonized Commodity
Classification System: “Computer”
• HS 84: “Nuclear reactors, boilers, machines and
• HS 8471: “Automatic data processing machines
and units thereof, magnetic or optical readers,
machines for transcribing data”
• HS 847120: “Digital auto data proc mach contng
in the same housing a CPU and input & output
Example: Vietnam War.
U.C. MELVYL Online Catalog
FIND XSU VIETNAM WAR
Search Results: 0 records
FIND XSU VIETNAMESE CONFLICT
Search Results: 4,190 records
Emanuel Goldberg: Aerial photography
using a “Drachen”
Example: Tethered balloons.
German: Drachen (= Kite in dictionary)
“Entry vocabulary” search interfaces:
• Software and algorithms map natural language
vocabulary to specialized metadata terms.
• Allows users to enter ordinary language queries while
taking advantage of existing subject headings,
• Uses co-occurrence statistics to link users’ ordinary
language terms to system vocabularies
• Statistical association between lexical items in titles and
abstracts and the system’s metadata vocabulary
• Suggests most likely system vocabulary
• Facilitates browsing where structure is
present: Broader, narrower, related terms
• Guides searcher to other parts of the
Retrieval set analysis
• Navigation within micro-domain
Web access: WWW forms-based application
supported by Perl
Supports searches on remote repositories
Four subdomain dictionaries in three databases
--- BIOSIS (Biological abstracts): subdomain
--- INSPEC: subdomains: “information
--- U.S. Patent Office classification
Statement of work:
• Varied prototype Entry Vocabulary Modules.
• Unintrusive development of EVMs by agents
• Sensitivity to subdomains.
• Natural language processing to augment
statistical term frequency.
• Recommendations for metadata “codebooks”
for numeric databases.