Your SlideShare is downloading. ×
Group Name: Research and New Technologies Interest Group
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Group Name: Research and New Technologies Interest Group


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Meeting Minutes/Summary October 2005 Members Council Meeting Group Name: Research and New Technologies Interest Group Prepared and submitted by: Sharon Bosarge Have the minutes been reviewed by the group chair? Yes X No Attending: Day 1 Attendees Day 2 Attendees Karen Boehning, WILS (Facilitator) Karen Boehning, WILS (Facilitator) Lynne Siemers, OCLC CAPCON Gregg Silvis, PALINET Jennifer Younger, INCOLSA Lynne Siemers, OCLC CAPCON Migell Acosta, OCLC Western Deb Carver, OCLC Western Gregg Silvis, PALINET John Ulmschneider, SOLINET Deb Carver, OCLC Western Frank Wojcik, NYLINK John Ulmschneider, SOLINET Shirley Baker, MLNC Eleanor Frierson, FEDLINK Jennifer Younger, INCOLSA Shirley Baker, MLNC Eleanor Frierson, FEDLINK Lorcan Dempsey, OCLC Rosalind Hattingh, SABINET Online Diane Vizine-Goetz, OCLC Eric Childress, OCLC Brian Lavoie, OCLC Bob Bolander, OCLC Bob Bolander, OCLC Scott Shultz, OCLC Joan Mitchell, OCLC Joan Mitchell, OCLC Eric Childress, OCLC Jay Jordan, OCLC Sharon Bosarge, OCLC (Recorder) Sharon Bosarge, OCLC (Recorder) Day 1 Diane Vizine-Goetz of OCLC Research provided a presentations on the DeweyBrowser and Curiouser The DeweyBrowser is a research prototype that supports searching and browsing collections of resources organized by Dewey. The prototype was developed to make the most of DDC numbers assigned to library materials and to explore the use of AJAX (Asynchronous JavaScript and XML) technology in a browser interface. The interface presents search results at three levels corresponding to the three main summaries of Dewey. To use the DeweyBrowser, a user navigates up and down the Dewey hierarchy by clicking on a category or enters a search term. The categories are color-coded to indicate where matching records occur. Red, orange, and yellow (warm colors) indicate the greatest number of records. Green and blue (cool colors) are used for categories with fewer records. White is used for categories with no matching records Summaries can be displayed in English, French, German, Spanish, or Swedish. Search and browse results can also be limited to resources written in a particular language The DeweyBrowser has been deployed over three collections of resources: • eBooks - 210,000+ electronic books • WorldCat – 2.2 million of the most widely held WorldCat records
  • 2. • Dewey Abridged – selected data from the Abridged Edition 14 of DDC The WorldCat records and ebooks collections are linked to the “Find in a Library” web service (Open WorldCat) The DeweyBrowser uses AJAX (Asynchronous JavaScript and XML). AJAX is an approach to programming web interfaces that allows user interaction with a web page without refreshing the whole screen. AJAX speeds up the interface by requesting only parts of a page, instead of the entire page. Refreshing only the part of the screen that changes tends to encourage exploration. This type of browsing behavior is central to how the DeweyBrowser was designed to be used. The AJAX technique is sometimes called dynamic HTML and is being used on many types of web pages. The DeweyBrowser is an example of an entire application built using AJAX. The group asked Diane to explain the genesis of this project. Diane said that the idea was to experiment with using the Dewey structure to browse large collections. The group felt that the DeweyBrowser was a good way to simulate “shelf browsing” of physical materials for electronic resources. The group also wondered if something similar could be developed for LC classification. The group would like to know what OCLC’s plans are for making the DeweyBrowser available to libraries to use against their local catalogs to provide virtual browsing. The group also wondered where this research might end up for other practical applications. Curiouser is an approach to making the best use of data about items in WorldCat and a user interface for exploring and selecting works and items. The prototype interface . . . • Employs the OCLC FRBR work-set algorithm • Exploits structured data in bibliographic, authority, and holdings records • Integrates techniques from FictionFinder for display and navigation of records in a FRBR context, • Explores Web services and other data sources to enhance the utility of Open WorldCat The following links provide additional information about the DeweyBrowser and Curiouser •ResearchWorks – •Curiouser – •DeweyBrowser – Brian Lavoie of OCLC Research reported results from an OCLC analysis of the G5 project There has been much interest in the Google Print for Libraries (G5) project, which aims to digitize the print book holdings of Harvard, Michigan, Oxford, NYPL, and Stanford. But there has been little discussion of Google Print for Libraries as an aggregate collection. To address this gap, Brian Lavoie, Lynn Connaway, and Lorcan Dempsey (OCLC Research) recently published “Anatomy of Aggregate Collections: The Example of Google Print for Libraries”(D-Lib, September 2005): The 55 million records in WorldCat (as of January 2005) can be filtered down to 32 million records describing print books. Of those 32 million print books, the G5 libraries hold about 10.5 million (33
  • 3. percent). Analysis of the holdings overlap of the 10.5 million books in the G5 aggregate collection suggests that there is a potential redundancy rate of 40% associated with the digitization effort. However, it was also noted that about 60% of the books in the G5 aggregate collection were held uniquely by one G5 library. About half the books in the G5 collection were English-language materials, while the rest were spread over more than 430 different languages. More than 80% of the G5 collection is still in copyright. The 10.5 million books in the G5 collection can be rolled up into about 9 million distinct works, compared to about 26 million in the system-wide print book collection. The researchers also conducted some speculative analysis looking at two questions: • What results would have been obtained if a different group of libraries had been selected? • What extensions to coverage can be obtained by adding additional collections to the original G5? A new data set was created by choosing 5 new libraries: 1) small US liberal arts college; 2) large US public university; 3) large US private university; 4) large US metropolitan public library; 5) large Canadian university. These 5 new collections yielded 5.9 million unique print books from about 8 million total holdings, and covered about 18 percent of the system-wide print book collection – significantly less than the original G5 collection. However, only 26% of the holdings of the 5 new collections were redundant, compared to more than 40 percent for the G5 libraries. Further analysis indicated that the print book collection of the metropolitan public library was the most dissimilar to what was in the G5 collection, while that of the liberal arts college was most similar. Combining the 5 new collections with the G5 collection yielded a new aggregate collection of 12.3 million books, a 17% increase over the original G5. It is clear that diminishing returns can set in quickly with mass digitization programs. Mass digitization programs and other aggregate collections are increasingly common. Effective decision- making and planning can be aided by convergence on a set of standard questions to help map out the anatomy of aggregate collections. Some questions might include: • What are characteristics of overarching population of materials that is target of digitization effort? • How much of population will digitization effort cover? • What is potential degree of redundancy? • What bibliographic unit is focus of digitization (e.g., manifestations, expressions, works)? • What number of participants and combination of institution types is optimal for obtaining maximum benefit with minimum cost? WorldCat is a strategic resource for answering these kinds of questions. OCLC Group Services (, OCLC WorldCat Collection Analysis Service collectionanalysis/), and OCLC Research data-mining activities ( are good examples of how WorldCat can be used to help analyze and manage aggregate collections. Day 2 Eric Childress from the OCLC Research provided a presentation on automatic cataloging and classification At the last meeting of this interest group during the May Members Council meeting, the group asked for some additional information and discussion about automated cataloging and classification. This presentation is a response to that request.
  • 4. The key question is can machines be leveraged to semi-automatically or automatically produce acceptable baseline (or enriched) metadata? The answer is yes, but with some caveats. Two approaches to automating metadata generation are harvesting (drawing from metadata in one or more sources) and extraction (drawing from attributes of the resource and/or content in the resource), and these are often used in tandem. Harvesting and extraction can also be integrated with other tactics including optimizing human input (e.g., prompting humans to decide between probable values). Tools are available from many sources, and may be specialized in task and/or domain (e.g., medical documents), integrated (e.g., in Digital Asset Management systems) or standalone. Some frequently- encountered features are: •[Simple]: document statistics, file type •[Complex]: language detection, audience level, topics, entities represented, document structure, taxonomy derivation. The LC (Library of Congress) BEAT (Bibliographic Enrichment Advisory Team) activities are of interest, including: • MARC records from harvesting - E-CIP and Publications in series metadata automation. • Enrichment projects including TOCs: E-CIP, ONIX, dTOC project and work with bibliographies and pathfinders. Also of interest are NSDL (National Science Digital Library) projects: The MetaExtract project -- a collaboration of CNLP (Syracuse U) & SIS (Syracuse U) -- to automatically generate metadata for course-oriented materials. The Lenny project is being undertaking by the Cornell NSDL group and INFOMINE and encompasses a suite of software-orchestrated activities including OAI harvesting with metadata augmentation using iVia and third party services to provide metadata enhancements to metadata destined for a central repository. Findings from the MetaExtract study show that automatically generated versus manually assigned metadata is comparable in retrieval performance and quality for most browsing elements. The findings also show that it may be better for enabling fielded searching and browsing results. Other projects of interest include: • AMeGA (Automatic Metadata Generation Applications Project) at the University of North Carolina at Chapel Hill SILS Metadata Research Center • iVia software developed by INFOMINE and in use by the National Science Digital Library and other digital library projects • Automatic Exposure is an RLG-led initiative that advocates capturing standard technical metadata about digital images automatically, as part of image creation Current activities at OCLC include OCLC Research projects for automatic classification, FRBR-related ("best data" work record derived from many manifestation records), and SchemaTrans (a technique for translating between metadata formats). OCLC production services employing metadata creation automation techniques are the Digital Archive, the WorldCat link, and Connexion. The following links contain additional information about automatic cataloging and classification projects in OCLC Research. • Automatic classification projects -
  • 5. • OCLC ResearchWorks - The group also discussed the need for OCLC and the library community to keep up with developments in the field of medical informatics. Jay told the group that he has had some preliminary conversations in this area with researchers at the Ohio State University Medical Center. Jay also said that Betsey Wilson had recently organized a “Digital Futures Alliance” meeting that included representatives from Boeing, Corbis, a large medical services provider, and others in areas that must manage large amounts of complex information. Recommended Agenda Items for Next Meeting: • Further information on developments in the area of automated cataloging and classification Meeting-related Resources These minutes, the Key Issues Report, and all presentations will be available on the Research and New Technologies Interest Group meeting Web site, available from: