October 2005 Members Council Meeting
Group Name: Research and New Technologies Interest Group
Prepared and submitted by: Sharon Bosarge
Have the minutes been reviewed by the group chair? Yes X No
Day 1 Attendees Day 2 Attendees
Karen Boehning, WILS (Facilitator) Karen Boehning, WILS (Facilitator)
Lynne Siemers, OCLC CAPCON Gregg Silvis, PALINET
Jennifer Younger, INCOLSA Lynne Siemers, OCLC CAPCON
Migell Acosta, OCLC Western Deb Carver, OCLC Western
Gregg Silvis, PALINET John Ulmschneider, SOLINET
Deb Carver, OCLC Western Frank Wojcik, NYLINK
John Ulmschneider, SOLINET Shirley Baker, MLNC
Eleanor Frierson, FEDLINK Jennifer Younger, INCOLSA
Shirley Baker, MLNC Eleanor Frierson, FEDLINK
Lorcan Dempsey, OCLC Rosalind Hattingh, SABINET Online
Diane Vizine-Goetz, OCLC Eric Childress, OCLC
Brian Lavoie, OCLC Bob Bolander, OCLC
Bob Bolander, OCLC Scott Shultz, OCLC
Joan Mitchell, OCLC Joan Mitchell, OCLC
Eric Childress, OCLC Jay Jordan, OCLC
Sharon Bosarge, OCLC (Recorder) Sharon Bosarge, OCLC (Recorder)
Diane Vizine-Goetz of OCLC Research provided a presentations on the DeweyBrowser and
The DeweyBrowser is a research prototype that supports searching and browsing collections of resources
organized by Dewey. The prototype was developed to make the most of DDC numbers assigned to
The interface presents search results at three levels corresponding to the three main summaries of Dewey.
To use the DeweyBrowser, a user navigates up and down the Dewey hierarchy by clicking on a category
or enters a search term. The categories are color-coded to indicate where matching records occur. Red,
orange, and yellow (warm colors) indicate the greatest number of records. Green and blue (cool colors)
are used for categories with fewer records. White is used for categories with no matching records
Summaries can be displayed in English, French, German, Spanish, or Swedish. Search and browse results
can also be limited to resources written in a particular language
The DeweyBrowser has been deployed over three collections of resources:
• eBooks - 210,000+ electronic books
• WorldCat – 2.2 million of the most widely held WorldCat records
• Dewey Abridged – selected data from the Abridged Edition 14 of DDC
The WorldCat records and ebooks collections are linked to the “Find in a Library” web service (Open
programming web interfaces that allows user interaction with a web page without refreshing the whole
screen. AJAX speeds up the interface by requesting only parts of a page, instead of the entire page.
Refreshing only the part of the screen that changes tends to encourage exploration. This type of browsing
behavior is central to how the DeweyBrowser was designed to be used. The AJAX technique is
sometimes called dynamic HTML and is being used on many types of web pages. The DeweyBrowser is
an example of an entire application built using AJAX.
The group asked Diane to explain the genesis of this project. Diane said that the idea was to experiment
with using the Dewey structure to browse large collections. The group felt that the DeweyBrowser was a
good way to simulate “shelf browsing” of physical materials for electronic resources. The group also
wondered if something similar could be developed for LC classification. The group would like to know
what OCLC’s plans are for making the DeweyBrowser available to libraries to use against their local
catalogs to provide virtual browsing. The group also wondered where this research might end up for other
Curiouser is an approach to making the best use of data about items in WorldCat and a user interface for
exploring and selecting works and items. The prototype interface . . .
• Employs the OCLC FRBR work-set algorithm
• Exploits structured data in bibliographic, authority, and holdings records
• Integrates techniques from FictionFinder for display and navigation of records in a FRBR context,
• Explores Web services and other data sources to enhance the utility of Open WorldCat
The following links provide additional information about the DeweyBrowser and Curiouser
Brian Lavoie of OCLC Research reported results from an OCLC analysis of the G5 project
There has been much interest in the Google Print for Libraries (G5) project, which aims to digitize the
print book holdings of Harvard, Michigan, Oxford, NYPL, and Stanford. But there has been little
discussion of Google Print for Libraries as an aggregate collection. To address this gap, Brian Lavoie,
Lynn Connaway, and Lorcan Dempsey (OCLC Research) recently published “Anatomy of Aggregate
Collections: The Example of Google Print for Libraries”(D-Lib, September 2005):
The 55 million records in WorldCat (as of January 2005) can be filtered down to 32 million records
describing print books. Of those 32 million print books, the G5 libraries hold about 10.5 million (33
percent). Analysis of the holdings overlap of the 10.5 million books in the G5 aggregate collection
suggests that there is a potential redundancy rate of 40% associated with the digitization effort. However,
it was also noted that about 60% of the books in the G5 aggregate collection were held uniquely by one
G5 library. About half the books in the G5 collection were English-language materials, while the rest
were spread over more than 430 different languages. More than 80% of the G5 collection is still in
copyright. The 10.5 million books in the G5 collection can be rolled up into about 9 million distinct
works, compared to about 26 million in the system-wide print book collection.
The researchers also conducted some speculative analysis looking at two questions:
• What results would have been obtained if a different group of libraries had been selected?
• What extensions to coverage can be obtained by adding additional collections to the original G5?
A new data set was created by choosing 5 new libraries: 1) small US liberal arts college; 2) large US
public university; 3) large US private university; 4) large US metropolitan public library; 5) large
Canadian university. These 5 new collections yielded 5.9 million unique print books from about 8 million
total holdings, and covered about 18 percent of the system-wide print book collection – significantly less
than the original G5 collection. However, only 26% of the holdings of the 5 new collections were
redundant, compared to more than 40 percent for the G5 libraries. Further analysis indicated that the print
book collection of the metropolitan public library was the most dissimilar to what was in the G5
collection, while that of the liberal arts college was most similar. Combining the 5 new collections with
the G5 collection yielded a new aggregate collection of 12.3 million books, a 17% increase over the
original G5. It is clear that diminishing returns can set in quickly with mass digitization programs.
Mass digitization programs and other aggregate collections are increasingly common. Effective decision-
making and planning can be aided by convergence on a set of standard questions to help map out the
anatomy of aggregate collections. Some questions might include:
• What are characteristics of overarching population of materials that is target of digitization effort?
• How much of population will digitization effort cover?
• What is potential degree of redundancy?
• What bibliographic unit is focus of digitization (e.g., manifestations, expressions, works)?
• What number of participants and combination of institution types is optimal for obtaining maximum
benefit with minimum cost?
WorldCat is a strategic resource for answering these kinds of questions. OCLC Group Services
(http://www.oclc.org/groupservices/), OCLC WorldCat Collection Analysis Service http://www.oclc.org/
collectionanalysis/), and OCLC Research data-mining activities
(http://www.oclc.org/research/projects/mining/) are good examples of how WorldCat can be used to help
analyze and manage aggregate collections.
Eric Childress from the OCLC Research provided a presentation on automatic cataloging and
At the last meeting of this interest group during the May Members Council meeting, the group asked for
some additional information and discussion about automated cataloging and classification. This
presentation is a response to that request.
The key question is can machines be leveraged to semi-automatically or automatically produce acceptable
baseline (or enriched) metadata? The answer is yes, but with some caveats. Two approaches to
automating metadata generation are harvesting (drawing from metadata in one or more sources) and
extraction (drawing from attributes of the resource and/or content in the resource), and these are often
used in tandem. Harvesting and extraction can also be integrated with other tactics including optimizing
human input (e.g., prompting humans to decide between probable values).
Tools are available from many sources, and may be specialized in task and/or domain (e.g., medical
documents), integrated (e.g., in Digital Asset Management systems) or standalone. Some frequently-
encountered features are:
•[Simple]: document statistics, file type
•[Complex]: language detection, audience level, topics, entities represented, document structure,
The LC (Library of Congress) BEAT (Bibliographic Enrichment Advisory Team) activities are of
• MARC records from harvesting - E-CIP and Publications in series metadata automation.
• Enrichment projects including TOCs: E-CIP, ONIX, dTOC project and work with bibliographies
Also of interest are NSDL (National Science Digital Library) projects: The MetaExtract project -- a
collaboration of CNLP (Syracuse U) & SIS (Syracuse U) -- to automatically generate metadata for
course-oriented materials. The Lenny project is being undertaking by the Cornell NSDL group and
INFOMINE and encompasses a suite of software-orchestrated activities including OAI harvesting with
metadata augmentation using iVia and third party services to provide metadata enhancements to metadata
destined for a central repository.
Findings from the MetaExtract study show that automatically generated versus manually assigned
metadata is comparable in retrieval performance and quality for most browsing elements. The findings
also show that it may be better for enabling fielded searching and browsing results.
Other projects of interest include:
• AMeGA (Automatic Metadata Generation Applications Project) at the University of North
Carolina at Chapel Hill SILS Metadata Research Center
• iVia software developed by INFOMINE and in use by the National Science Digital Library and
other digital library projects
• Automatic Exposure is an RLG-led initiative that advocates capturing standard technical metadata
about digital images automatically, as part of image creation
Current activities at OCLC include OCLC Research projects for automatic classification, FRBR-related
("best data" work record derived from many manifestation records), and SchemaTrans (a technique for
translating between metadata formats). OCLC production services employing metadata creation
automation techniques are the Digital Archive, the WorldCat link, and Connexion.
The following links contain additional information about automatic cataloging and classification projects
in OCLC Research.
• Automatic classification projects - http://www.oclc.org/research/projects/auto_class/
• OCLC ResearchWorks - http://www.oclc.org/research/researchworks/
The group also discussed the need for OCLC and the library community to keep up with developments in
the field of medical informatics. Jay told the group that he has had some preliminary conversations in this
area with researchers at the Ohio State University Medical Center. Jay also said that Betsey Wilson had
recently organized a “Digital Futures Alliance” meeting that included representatives from Boeing,
Corbis, a large medical services provider, and others in areas that must manage large amounts of complex
Recommended Agenda Items for Next Meeting:
• Further information on developments in the area of automated cataloging and classification
These minutes, the Key Issues Report, and all presentations will be available on the Research and New
Technologies Interest Group meeting Web site, available from: