Cushman Exposed! Exploiting Controlled Vocabularies to Enhance Browsing and Searching of an Online Photograph Collection
Upcoming SlideShare
Loading in...5
×
 

Cushman Exposed! Exploiting Controlled Vocabularies to Enhance Browsing and Searching of an Online Photograph Collection

on

  • 294 views

Dalmau, Michelle and Jenn Riley. "Cushman Exposed! Exploiting Controlled Vocabularies to Enhance Browsing and Searching of an Online Photograph Collection." Digital Library Program Brown Bag ...

Dalmau, Michelle and Jenn Riley. "Cushman Exposed! Exploiting Controlled Vocabularies to Enhance Browsing and Searching of an Online Photograph Collection." Digital Library Program Brown Bag Presentation, May 17, 2004.

Statistics

Views

Total Views
294
Views on SlideShare
291
Embed Views
3

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 3

http://www.slideee.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Our presentation today will cover: <br /> An introduction, which includes an overview of the Cushman collection and what we learned from previous projects. <br /> The metadata section, which will cover usage of metadata for image collections in general as well as how the Cushman Collection utilizes metadata. <br /> The research overview section, which will include a literature review of existing browse and search research that complements the Cushman project. <br /> Then we will present an overview of the usability findings and how they, along with existing research, led to our browse and search specifications. <br /> We wind down by discussing implementation issues, focusing on how we integrated a controlled vocabulary with our Cushman search engine. <br /> And lastly, we will discuss what we’ve learned, what we’d do over and what we love about Cushman. <br /> We will provide demos along the way to make things interesting. And at any time, feel free to ask questions. <br />
  • We were awarded an IMLS grant to make available online the Charles W. Cushman collection of 14,500 photographs shot between 1938-1969. <br /> Cushman, an amateur photographer, was an early adopter of color slide film while many of his contemporaries were still shooting in B/W. <br /> Cushman traveled extensively, and his photos capture “slices of life” – of people, nature, places, and events – which yield visual documentaries of the times. <br /> [demo] <br /> Home page – timeline – highlights – <br /> Other aspects of the website such as essays about Cushman and his photos. . . <br />
  • Before we begin discussing implementation details of the Cushman collection, it is important to mention one of its predecessors, the U.S. Steel Gary Works Photograph Collection. <br /> It contains approximately 2,200 images most of which are accompanied by archival descriptions. A controlled vocabulary was used to further describe the subject matter of the images. The search engine only looks at the actual headings assigned to the images, and does not use the structure of the controlled vocabulary to assist searching. <br /> So if a user types “doctors” into the search box she won’t get results while “physicians” yields six results. In order to effectively search the collection, the user must reference an alphabetical listing of all subject headings used [show image]. The user is expected to look through the entire list of subject headings for the one that most closely matches their search term. <br /> Usability testing revealed that this approach was difficult for users. They often did not reference the alphabetical list at all, and when they did, generally they would only look for one term and give up if they did not find it. <br /> Obviously, this approach is not one we wanted to duplicate for the Cushman collection. Instead, we needed methods to integrate the controlled vocabulary’s structure into the browse and search functionality, taking the burden off of the user. <br />
  • But, obviously, we can&apos;t abandon free-text descriptions completely in favor of controlled vocabularies. One advantage to “free-text” descriptions is that they keep, in this case, the photographer’s original description. Cushman’s personality and focus are preserved in his actual words. However, his descriptions are not consistent in many ways. They have factual errors, abbreviations, and differing levels of completeness. The original descriptions are by themselves not enough to provide high-quality search and browse access to the collection. <br /> The addition of terms from controlled vocabularies is intended to overcome these disadvantages of using free text descriptions alone. Sometimes these descriptions are not accurate or clear, and the controlled vocabulary serves to document images more consistently. We&apos;re librarians, so we know the value of controlled vocabularies as access points for resources. The use of controlled vocabularies has a long history rooted in solid reasoning. We know how they add additional access points, collocate related records, disambiguate different meanings of a word, and, when the same CV is used for multiple collections, cross-collection searching is much improved. <br /> We assume that our analytical cataloging efforts are increasing access for our users but average users do not understand controlled vocabularies much less know how to use them, and our current IR systems rarely do anything to help them. <br /> So, then, can we design a system that utilizes the structure of a controlled vocabulary to allow us to catalog according to time-tested principles yet increase access for our users? First we have to learn a little bit about what these CVs are and how they are structured. <br />
  • Mr. Cushman avidly documented every shot in a series of notebooks. [show image] and on his slide mounts [show image] He would often record exposure and f-stop settings from the camera, dates, locations, and names of people in the photographs. Despite the overabundance of descriptive information that came with the collection, we felt it was important to use additional metadata to enhance geographic, genre, and subject access. <br />
  • &lt;http://www.getty.edu/research/conducting_research/vocabularies/tgn/&gt; <br /> Click “Browse the TGN hierarchies” <br /> Point out Ottoman Empire (historical place name) <br /> Icons show what has hierarchy below <br /> Click hierarchy for North & Central America <br /> Click United States link for info on this place <br /> Click hierarchy beside US for stuff underneath <br /> Point out not just states – lots of canals, reservoirs, etc. Good research use. <br /> Click New Search at top <br /> Search for Springfield. Show as research tool – what state/county is Springfield in? <br />
  • http://www.loc.gov/rr/print/tgm2/ <br /> Browse photographs <br /> Click “Cityscape photographs” – hierarchically underneath both Cityscapes and Photographs <br /> Back up to photographs – show how terms aren’t for “fine arts” usage, but more general usage. <br /> Mention &quot;glamour photographs&quot; <br />
  • http://www.loc.gov/rr/print/tgm1/ <br /> Search “dogs” <br /> Note UF, NT, BT, and RT <br /> Look at the NT list – we’ll come back to this <br /> Scroll down to show other terms <br /> Terms like this for what an image is OF <br /> Search &quot;democracy&quot; <br /> Terms like this for what an image is ABOUT – we don&apos;t use these much. City & town life might be closest example <br />
  • TGM I strengths include pre-defined relationships between concepts. Terms can be marked as broader, narrower, or related to another term. TGM I also provides some lead-in vocabulary, linking words that users and catalogers might think of to their synonyms that TGM prefers. This structure of relationships linking terms to one another can potentially be used both when describing and when browsing and searching collections, although they’ve historically been much more heavily used for the former. <br /> TGM I Weaknesses include lacking relationships for new terms. Recently, the word “courtesans” was added to TGM I, and only assigned two links with other terms: related to “prostitution” and “relations between the sexes.” It was not associated with the broader term “people”. (We hope this doesn&apos;t mean that LC doesn&apos;t think courtesans are people.) We find that many of the newer terms tend to integrate poorly with the existing structure. <br /> TGM I language is very formal, and thus can be user-unfriendly. The average person would not necessarily describe communes as “collective settlements.” This sort of formalization is common in controlled vocabularies, but users have a difficult time thinking of these formal terms on their own. <br /> Although lead-in vocabulary is integrated in the TGM I structure, there is not enough of it. Some omissions are consistent, for instance phrases such as “nails & spikes” are not referenced from each individual term in the compound heading. Also, singular forms of words are not part of the lead-in vocabulary structure. In addition to these consistent problems, general synonyms that users would think of are often missing as well. <br /> Lastly, there are too many top-level categories to be useful as initial browsing access points for end-users. Of the 6300 total terms in TGM I, over 1300 do not have a broader term. A good browsing hierarchy would have no more than 20 top-level terms. Yahoo!, for example, has 14 top-level categories. Also, the language used for many of these top-level subjects simply baffles users. Would you know to find “clouds” under natural phenomena or “bridges” under transportation facilities? <br />
  • The following slides will cover a brief review of the significant literature addressing how controlled vocabularies are used to assist in accessing content. <br /> Raya Fidel is known for early 1990’s studies in which she investigated overlap between a user’s free-text terminology and thesauri descriptors and frequency of thesauri use. In a nutshell, she learned that there was a considerable overlap between CV terms and free-text terms. She also learned that users would often consult a thesaurus in order to improve their search results. She concludes by stating that more research is necessary to understand how to integrate “mapping” or “switching” languages to standardize the search across resources while supporting the ability to search both free-text and descriptors at once. <br /> Choi and Rasmussen recently looked at how a) users query for images and b) how they <br /> determine relevancy. Users overwhelmingly approach images in a very textual way – search by artist, title, date, but topic was the most important discovery criteria. Users also determine relevance, especially topical relevance, by looking at the images themselves over the metadata. Choi and Rasmussen also mapped the user’s free-text queries to various controlled vocabularies and concluded that users would benefit from more in-depth indexing with robust Controlled Vocabularies that support various relationships and have good synonym mapping to increase access. <br /> Jane Greenberg is a contemporary advocate for the integration of controlled vocabularies structures with retrieval systems. She introduced the concept of optimal query expansion processing methods (which simply put means: she’s made explicit how to use the syndetic relationships -- BT, NTs, USED FOR, etc. -- to either create automatic mappings to the CV or to create user-initiated options). <br /> She conducted a study where she observed real users, querying in real time several databases that utilize CVs as a separate search feature (such as CSA). She took Fidel’s study a step further by not only mapping the free-text search terms with the CV terms, but she then explored with users only the CV terms that matched their original topical queries. Users selected controlled terms that they felt matched their original query. Green berg then mapped these terms to the syndetic structure, which led to her suggestions regarding automatic and user-initiated retrieval. <br /> In the late 1990’s, Getty experimented with two image retrieval applications that utilized Getty CVs such as AAT and TGN to describe the contents (a.k.a. which morphed into ARThur). Getty researchers felt that access to related concepts (related terms and synonyms) should be reconciled in the search interface for users. They also explored ways to suggest terms based on how a search was initiated. Problems they faced with these experimental systems concerned performance impact (so they axed the automatic retrieval of all NTs or children records) as well as reconciling how the various CVs treats the same term thus causing a confusing result set (e.g. painter in AAT, painter in Union List of Artist Names as artist name and painter in TGN as place). <br />
  • David Bawden asserts that browsing is purposive –even if users do not know exactly what they are looking for, they do have goals or information needs to meet when browsing. Typical browsing objectives include: finding similar items, using predefined categorizations or hierarchies to find content, and obtaining an overview of the contents within a particular resource. <br /> Choi and Rasmussen as part of their 2002 and 2003 studies with LC’s American Memory found that browsing is a significant part of image discovery – especially topical browsing. <br /> Quite a few research-oriented digital image resources have been created to complement the studies of existing image collections. The two of particular importance are University of Michigan’s SI Art Image Browser and the more recent Flamenco browser designed by Berkeley. <br /> SI Art Image Browser uses the AAT to provide broad categories to assist the user in browsing and searching (facets). Of particular interest is how they provided a hierarchical browse feature based on the AAT’s structure. So users can explore the database by increasingly refining or expanding their browse options using this feature. <br /> Flamenco is a contemporary research project that provides a hybrid browse/search interface to allow users to explore a large domain without “feeling lost”. <br /> Much like the Michigan project, the Fine Arts Interface uses AAT as a way to provide a hierarchical and faceted navigation structure for discovery. Users can combine facets (artist and medium), can explore images hierarchically (country then city) or do both. <br /> Every link selected generates new results so you get instant (visual) feedback. It’s been affirmed through various user studies, that the Flamenco interface provides the ability to easily explore an online collection by refining or broadening a browse or search. The combination of browsing and searching along with relevant suggestions also supports both directed (expert) and non-directed (novice) exploration of the resource. <br />
  • Since January 2003 several usability studies were held. Most involved the evaluation of an evolving prototype. Linked here are excerpts from the various prototypes we mocked up for user feedback. They are based on what we learned from similar research projects. <br /> [Group Walkthrough] <br /> In the group walkthrough session, participants worked individually with tasks based on paper versions of the prototype and then as a group we explored obstacles in solving the tasks, which led to design alternatives based on the groups’ suggestions. After, a guided walkthrough of a basic Cushman prototype web site was conducted for further feedback. Other resources frequently used by the participants were also evaluated. The tasks were not assessed in terms of Pass/Fail, but instead served as ways to explore the preliminary design. <br /> The 4 participants in this session were mostly librarian types. One librarian and 2 SLIS students (1 with a BFA in photography). The 2 SLIS students were working with photograph collections at the Lily Library. The final participant is a professional photographer, but has worked in the library as well. <br /> [Individual Walkthrough] <br /> As a follow-up to the group walkthrough a similar session was conducted with faculty in the humanities who use images in their teaching or research. We visited two faculty members, one from Art History and the other from History so they could show us how they use images in their teaching, what sources they draw from, their process for discovery, etc. as well as have them give us feedback on the Cushman prototype in light of their experiences. <br /> [Task Scenarios] <br /> The task scenario study was designed to test browse and search functionality that emerged based on the previous two studies. This time, the evolving prototype was enhanced to simulate functionality, especially CV integration. <br /> The purpose of this evaluation was to test a) if the browse and search suggestions made sense; b) if users would use the suggestions – tasks were not necessarily geared toward making the participants use the suggestions; and c) if participants could successfully browse and search the collection via the typical means without getting side-tracked by the suggestions. A secondary focus was to test to see if the notebook browse made sense. <br /> We met with 12 participants, staff, students and faculty, in their normal place of work or study, when ever possible, and asked them to work through 14 tasks. The participants were image users – they either worked with images on a daily basis or relied on images for teaching and learning. <br /> Most of the browse and search tasks (aside from the notebook related tasks) were successfully accomplished. <br /> I will now summarize specific usability findings over the next two slides that resulted from these sessions. <br />
  • Our usability studies supported the need for utilizing the structure of the controlled vocabulary to aide the user in searching. During the walkthroughs other resources used by patrons such as Vivisimo and Teoma search engines were explored due to their ability to cluster results and provide suggestions. <br /> Nearly all participants disliked or avoided referencing a long list of subject terms in order to formulate their queries. Several users claimed that they would simply use a “hit or miss” approach to searching our system, rather than wade through a long list of subjects looking for one that interested them. <br /> Our users were also concerned about word choice. A few participants revealed uncertainty about what terms to use to ensure hits. Several of the task findings from the walkthroughs reveal that participants did not know where to use US, USA or America to generate hits. The group walkthrough generated many questions about cross-referencing and controlled vocabulary use. If you recall, this group was composed of mostly librarians so a lively conversation ensued. They drew from their experiences as intermediaries to affirm that an integrated controlled vocabulary would help users broaden and narrow their searches. <br /> Users often iteratively modify queries based on the results they retrieve. The ability to reformulate a query in context with relevant suggestions from the system was strongly welcomed by users in our studies. Many felt that these “system” suggestions both revealed more about the subject matter at hand and provided helpful clues for formulating better queries. <br /> During the individual walkthrough, the faculty member reviewed his experiences with using resources that provide suggestion but remained uncertain as to how his query yielded the suggestions. We looked at the Cushman prototype – in particular this screen – and he essentially poo-pooed it. He suggested offering context such as: “Your term ‘street life’ generated these broader categories for searching.” <br /> The other faculty member felt that a topical search should result in all images that are part of that category even if she didn’t use the precise words in the query to represent this topic in its entirety. This suggestion harks back to the research and how users benefit from the automatic retrieval of all narrower terms so if you type “sports” you get images of baseball, football, soccer and so on. <br /> Lastly, in support for providing suggestions in the interface, 10 users during the task-based study gravitated toward the CV-generated search suggestions right away. Most used the suggestions successfully. They used them with confidence and with little hesitation. <br /> Out of the group of 12, 2 were “searchers.” The “searchers” were not hindered by the suggestions, but they did feel they were extraneous. During debriefing, they began to explore the suggestions and claimed that they were helpful. <br />
  • Our usability studies revealed that browsing provides a glimpse into an unknown collection so it is important to reveal the contents as intuitively and fully as possible. <br /> Browsing should support the exploration of a resource by providing guidance and flexibility to change directions as is common when browsing. Many participants from the final study, the task scenarios, commented that the suggestions are well organized supporting this notion of coherence and guidance while browsing. <br /> Users want the ability to “drill-down” through the structure as they browse. <br /> We often heard that browsing “helps us get into Cushman’s head”. <br /> Our users frequently requested the ability to combine different browsing categories. <br /> The group walkthrough participants were actually the ones to flat-out suggest the ability to provide faceted browsing. They considered the way other large image collections facilitate browsing, as well as how patrons approach them for information, and they felt that combining year/subject or genre/location would be most helpful. <br /> Like the group participants, one of the faculty members suggested ways to meaningfully browse images by combining categories. Here’s a quote that reflects how he would approach research in his discipline: “I Could imagine going from Landscape Photos to Location. I look at cities and buildings. After clicking on Cityscape, I’d like to see it by place.” <br /> Lastly, the act of browsing cultivates searching. By browsing, users begin to not only hone in on what they need to find, but they learn from cues in the interface how to formulate queries for searching. During the task scenarios, users found the Browse Suggestions helpful. Many felt it was appropriate for browsing. Others felt it helped in exploring the collection: “Exposes breadth of collection” (U9). <br />
  • Based on our usability studies and research findings, we proposed specifications for searching and browsing the Cushman collection. This slide provides a summary of the significant requirements. <br /> We recommended that all lead-in terms would be mapped to the official terms in the thesaurus as retrieval of records described with all narrower terms would happen automatically. This functionality is necessary because we learned in our user studies that some TGM I authorized terms are incomprehensible so mapping lead-in terms and retrieving all narrower terms should increase access. <br /> Because we have an extensive amount of “free-text” descriptions provided by Cushman in addition to the controlled vocabulary descriptors, we recommended that an integrated search approach be adopted that sends a user’s query to search against both the descriptive fields in our database and to the TGM I thesaurus. [demo diagram] <br /> To go that extra mile to more fully expose the users to the controlled vocabulary as it relates to their original search, we also recommended that the interface support user-initiated reformulation of her query based on relevant broader or narrower terms. <br /> Browsing by year, genre, subject and location should be supported either individually or in combination. And categories with inherent hierarchical structures: subject terms (from TGM I) and location (from our database) should expose that structure to facilitate browsing. <br /> We also learned that users often sought a precise term when browsing and not necessarily a broad topic (tornado over natural disaster, for instance) so the subject browse list is comprised of cataloger-assigned terms along with TGM I lead-in terms for extra access. <br /> [Demo Cushman] <br /> search: <br /> car and shows and California (talk about term for term replacement) <br /> browse: <br /> seashores (maps to beaches) <br /> beaches (refine by Greece) <br /> remove “seashores” <br /> add Dwellings (subject hierarchy) <br /> Detail – “plant containers” <br />
  • Here is a summary of the technologies we chose to drive the Cushman collection. These are all pretty standard. The OracleText module you see is the core of using the controlled vocabularies intelligently – we’ll come back to that in just a second. Like all other DLP image collections, we provide persistent URLs to individual images and “detail” pages for images and metadata. <br />
  • The thesaurus support in Oracle Text makes possible the intelligent use of the controlled vocabulary structure described earlier. <br /> We loaded the TGM I data as a custom thesaurus into Oracle text, originally using the preferred term, broader term and narrower term relationships available in Oracle Text. A related term relationship also exists, but our user tests showed that providing access to related terms was not a priority. <br /> OracleText provides special syntax in SQL queries to use thesaural relationships to extend the power of a search. This is used for retrieving actual records when a user types in a lead-in term as a query, and is used to expand a subject term in a query to retrieve records matching either that subject or any of its narrower terms. <br /> We don&apos;t always want to retrieve list of records, however. In order to tell users what subjects are related to the subject they&apos;ve searched so they can broaden or narrow their result set, we use PL/SQL functions to query the thesaurus itself. These functions only return lists of thesaurus terms, and don&apos;t search the rest of the database for records that use them. <br />
  • Although both TGM I and the OracleText engine basically adhere to the ANSI/NISO Z39.19 Guidelines for the Construction, Format, and Management for Monolingual Thesauri American national standard, we still found some incompatibilities between them. <br /> TGM I has a few hundred cases where one term is a lead-in term for multiple preferred terms. Crops Use Farming; Use Plants is one example where the lead-in term has multiple slightly different meanings. &quot;Multiple births&quot; being a lead-in term for Twins, Triplets, Quadruplets, and Quintuplets is an example where a general term points to more specific lead-in terms. However, Oracle Text doesn&apos;t represent lead-in term mappings to preferred terms in this unidirectional way. Instead, it represents synonyms as a ring, with ONE and only one member of the ring as the preferred term. This model does not allow the multiple preferred terms of Twins, Triplets, Quadruplets, and Quintuplets, so we had to find another way to represent these relationships so that we could give our users access to all of the TGM preferred terms. We ended up changing these from Preferred Term relationships to Narrower Term Partitive relationships, and extending all subject queries to retrieve all NTPs to address this problem. <br /> For subject and keyword searching, we wanted terms entered to match subjects with that term as any part of the subject phrase, since we know users are unlikely to think up exactly the same subject phrase TGM uses. However, for browsing, since we’re displaying to users subjects from TGM, we want results retrieved from clicking on any browse link to match exactly that phrase and only that phrase. We had to take two different approaches to accomplish these two tasks that differ in one small but significant way. <br /> TGM I, like many thesauri, uses parenthetical qualifiers to distinguish different senses of the same word. Oracle Text knows that about this syntax and its standard meaning. TGM I, however, doesn’t use qualifiers in every case. For example, TGM I includes both the terms Cranes (by which they mean “hoisting machinery”) and Cranes (Birds). Oracle Text interprets Cranes without the qualifier to be a broader term than the version with the qualifier, when TGM I infers no such relationship. We had to process the terms in TGM I, removing the parentheses, before loading them into a custom thesaurus in Oracle Text in order to avoid this problem. <br /> TGM I, as a thesaurus designed primarily for human rather than machine processing, frequently contains authorized terms containing punctuation. We’ve already talked about parentheses, but they also routinely contain ampersands instead of the word AND, commas, and apostrophes. We had to carefully handle all of these cases to ensure users could enter subject terms exactly as TGM listed them, while operating within the constraints of a web environment where many of these characters have special meanings. <br />
  • Before we began metadata entry for the Cushman project, a team was created to define what descriptive and analytic metadata we would record for each slide. Despite this careful planning, we still needed to adjust the metadata model as the project progressed. No one had foreseen the need for recording multiple places for a single image (ambiguous place names like national parks, pictures from the tops of mountains, images of artworks), and when the need did become evident, catalogers developed a workaround. This is what catalogers are used to doing in the MARC environment using vended ILS systems. In the case of Cushman, however, we had designed the cataloging system ourselves, and we had the flexibility to adapt the metadata model as new needs were identified. Because workarounds for multiple places and other similar problems had been in place before the appropriate personnel learned of and addressed these issues, a great deal of clean-up work was required after primary cataloging was completed. In the future, we&apos;ll make sure that cataloging staff are more involved in the bigger-picture planning for metadata creation so problems like this can be addressed effectively as soon as they are identified. <br /> Quality control is another area where metadata creation is different from traditional cataloging. In environments we control, we can also introduce mechanisms to ensure consistency and avoid errors in metadata entry. For the Cushman project, we began with only a small amount of training and documentation for project catalogers, and cataloging was more than half done before cataloging staff had a dedicated supervisor. Consequently, the metadata being created was extremely inconsistent, with many different types of information being entered in a single field and very little inter-cataloger consistency with regard to the depth of description applied to a single image. We performed a metadata clean-up stage after primary cataloging had been completed to address these problems, but we would have saved time and money by addressing these problems sooner. <br /> We could have benefited from more rigorous user studies especially studies that specifically evaluated real-life, real-time use of image resources that do rely on controlled vocabularies to describe the contents of that collection. The usability studies we held made sure to explore such resources when they were introduced, but observation sessions of use in context were not well-planned. <br /> So obviously we learned some important practical lessons from this project. These lessons will help us complete future projects more efficiently. But we don’t now and never will have all the answers. We must continue finding new ways of delivering high-quality digital projects while increasing them in scope and complexity. <br />
  • Obviously, the Cushman photos themselves are phenomenal, and on their own worthy of a lot of attention. But the site we’ve built around the photos and the increased access we’ve provided via search and browse make a big difference. Aaron Reichert, my predecessor as the Digital Media Specialist in the DLP, told us “This is definitely the easiest and most versatile browsing and searching engine I&apos;ve ever used.” <br /> We’re also getting a lot of other kinds of evidence about the success of the site. Right now (May 7) it’s the #4 hit in Google for the search term “photograph.” It was a Yahoo! Pick of the Day on November 17, 2003, shortly after it was launched. It has an entry in the Internet Scout Report, ResearchBuzz, and the Librarians’ Index to the Internet. Weblog traffic mentioning the site has been extremely heavy (May 7 – google search on charles cushman blog gets 406 hits!). All of the blogs link to the site, most show an image or two, and many have comments by readers to the tune of “Wow! What great photos! I’ve been &lt;wherever&gt; and it &lt;looks just the same|has completely changed&gt;!” But one blog sticks out to us as indicative of the sort of impact a collection like this can have. The author of the blog BrokenType (identified only as Alex) made an entry in December 2003 entitled “A Speculative Slideshow.” The entry is a mixture of bits of background info on Cushman and the site, a detailed envisioning of the actions and conversation surrounding the shooting of one of the many photos of women in bathing suits on the site (She brushed her dress modestly, laid her black purse on the rock to anchor the magazine, and stood up. “Is this ok?” She had a German accent, he noted, adjusting the lens. “Yes, that’s perfect.”), and speculation about Charles’ relationship with his first wife, Jean (His wife, Jean, traveled with him frequently, and appears in the collection like a guilty conscious. She is awkward in front of the camera. When he posed her it was usually at a distance, or in the car.). <br /> A favorite activity of many DLP staff is to go Googling to see what’s new in blog traffic about the Cushman site. A few commenters on the site mention search and browse functionality, (one says, &quot;the collection, however you want to slice and view it (and the website provides ample ways…), is nothing short of amazing&quot;) but for the most part, they’re silent on the issue. This is a good thing. So often you hear commentary on image collections and the focus is on flaws in the delivery. We’re hearing none of that. Usability is something you only hear about when it’s bad. We’re confident our enhanced functionality facilitates the free exploration the images of the collection by our users. <br />
  • It has been great working with a project team that believed in the value of the innovations in searching and browsing we were proposing for the Cushman collection. The Cushman project has allowed us to take what we learned from existing image collections not one but several steps further. We hope that the success of the collection will allow it to be seen as a model for other online image collections. <br /> However, there are reasons why this sort of tight integration with the controlled vocabulary is mostly written about rather than realized—there are many technical, thesaurus management, and user interface issues that require attention to truly make an integrated, transparent, and dynamic browse and search happen. We’ve come up with some solutions to these issues for the Cushman collection. We’re now starting the process of trying to figure out how to repeat the functional gains we’ve made in future collections. We can’t afford to spend the money and time we spent developing Cushman for every image collection. We need ways to re-use parts of the system we’ve developed in new applications. <br /> We also intend to apply what we have learned and will learn to a future IU image repository that some on our team lovingly call the “Big Image Thing.” With the Cushman collection, we have taken our first major steps towards defining the functionality needed for this centralized repository. We have more user studies to do on our way towards this goal, but we know that dealing intelligently with TGM and other controlled vocabularies, including TGN, will definitely be a large part of that functionality. <br />

Cushman Exposed! Exploiting Controlled Vocabularies to Enhance Browsing and Searching of an Online Photograph Collection Cushman Exposed! Exploiting Controlled Vocabularies to Enhance Browsing and Searching of an Online Photograph Collection Presentation Transcript

  • Cushman Exposed! Exploiting Controlled Vocabularies to Enhance Browsing and Searching of an Online Photograph Collection Michelle Dalmau, mdalmau@indiana.edu Jenn Riley, jenlrile@indiana.edu IU Digital Library Program Brown Bag Series
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Overview  Introduction  Metadata  Research Overview  Usability Findings  Browse and Search Specifications  Implementation  Lessons Learned
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 The Cushman Collection  Funded with an Institute of Museum & Library Services (IMLS) grant  ~14,500 color slides taken between 1938- 1969  Held at the IU University Archives  Site launched October 2003 and March 2004
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Looking Back  U.S. Steel Gary Works Photograph Collection  ~2,200 Images  Archival descriptions  Assigned subject terms from CV  Subject field search requires referencing the A-Z list of subjects  Usability studies revealed not using the CV’s syndetic structure impacts searching
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Metadata for Image Collections  Advantages to “free-text” descriptions:  Preserve photographer’s notations  Resembles the user’s language  Advantages to CV descriptors:  More access points  Collocation  Disambiguation  Interoperability
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Metadata for the Cushman Collection  Cushman’s description in notebooks and slide mounts  Dates  Location  Names  TGM I – LC Thesaurus for Graphic Materials: Subject Terms  TGM II - LC Thesaurus for Graphic Materials: Genre & Physical Characteristics  TGN – Getty Thesaurus of Geographic Names
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 TGN: Getty Thesaurus of Geographic Names  Online browser available  Data available for licensing for incorporating into a local system  Current and historical place names  Hierarchically organized  Useful as research tool and as structured CV  Cushman cataloging  Cushman display
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 TGM II: Genre and Physical Characteristics Terms  Online and free downloadable versions available  Contains over 600 terms  Poly-hierarchically organized  We only used 24 TGM II terms  Multiple genres assigned when appropriate  More appropriate than AAT for our generalist users  Cushman cataloging  Cushman display
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 TGM I: Subject Terms  Online and free downloadable versions available  Contains over 6,300 terms  Hierarchically organized  Includes terms for what picture is OF (eg dogs) plus what picture is ABOUT (eg democracy)  Cushman cataloging  Cushman display
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 TGM I: Subject Terms Strengths and Weaknesses  Strengths include:  Pre-defined relationships between concepts  Some lead-in vocabulary  Weaknesses include:  Complete syndetic relationships lacking, especially for new terms  Language not user-friendly  Not enough lead-in vocabulary  Form and number of top-level categories not useful for a browse structure
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Searching Image Collections: Research Shows  Complement free-text with controlled vocabulary searching (Fidel, 1991)  Image retrieval is heavily based on textual labels (Choi & Rassmussen, 2003)  Query expansion methods based on the CV relationship structures can increase access (Greenberg, 2001/2002)  Automatic Expansion: Synonyms and Narrower terms are good candidates for automatic retrieval  Interactive Expansion: Broader, Narrower and Related terms are good candidates for user-directed, “manual” retrieval  Search assistants are helpful (Harping, Getty, 1999)  Integration of Getty vocabularies (“a.k.a” and ARThur)
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Browsing Image Collections: Research Shows  Browsing is exploratory – it fosters new connections, innovative use of resources and the ability to easily pursue new paths (Bawden, 1993)  Browsing is a significant part of image discovery (Choi & Rasmussen, 2002)  Guided, flexible browsing in context works ( Flamenco and SI Art Image Browser projects)
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Usability Methods  Group Walkthrough (prototype excerpt)  Paper-based tasks and prototype evaluation  4 participants (mostly librarians)  Individual Walkthrough  Interview and prototype evaluation  2 participants (faculty)  Task Scenarios (prototype excerpt 1 & 2)  On-site task-based testing (14 tasks)  12 participants (staff, students and faculty image users)
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Usability Findings Show  Searching  Referencing an A-Z list with no lead-in terms for searching is NOT helpful at all  Concerns about word choice (US, USA or America?)  Iterative reformulation of queries in context is desired  Relevant suggestions are helpful
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Usability Findings Show  Browsing  Structure is important  Contents should be easily exposed  Flexible and combinatorial browsing is desired  Browsing cultivates searching
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Implementation Specifications  Search  Mapping from lead-in vocabulary  Retrieval of all records with narrower terms  Integrated search against BOTH “free-text” descriptions and thesaurus  User-initiated broadening and narrowing  Browse  Year  Genre  Subjects (hierarchical)  Access via assigned headings with ability to move up and down (pending user studies)  Location (hierarchical)  Combination of facets
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Implementation of the Cushman Web site  Java using Java Servlet and Java Server Pages (JSP)  HTML / CSS for interface display  Oracle 9i, Release 2 database  Oracle Text  Tomcat and Apache HTTP servers  JPEG images served from file system (PURLS)
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Thesaurus-Enhanced Browsing & Searching: Oracle Text  Link to existing thesaurus or define custom thesaurus  Preferred terms  Broader terms  Narrower terms  Related terms  SQL syntax for using thesaurus to expand database query  PL/SQL stored procedures for getting information from thesaurus itself
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Challenges Using Oracle Text  Preferred term matches multiple lead-in terms  Crops USE Farming; USE Plants  Phrase matching  Military finds Military officers, Military uniforms, etc.  Qualifiers  Cranes vs. Cranes (Birds)  Punctuation used in TGM terms
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Lessons Learned Approach to metadata needs to be well-planned and flexible Metadata quality control is essential Need more data on how people use images This stuff is HARD!
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 But It’s Worth the Effort!  Enhanced discovery  Innovative implementation for a production-level collection  People love the Cushman Collection!
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Looking Forward  Strive to make our collections truly accessible even if only incrementally  Sustainability of the Cushman approach  Defining functionality for future image repository for all of our collections
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 References  Bawden, D. (1993). Browsing; theory and practice. Perspective in information management, 3 (1): 71-85.  Choi, Youngok and Rasmussen, Edie M. (2002). Users’ relevance criteria in image retrieval in American history. Information Processing and Management, 38: 695-726.  Choi, Youngok and Rasmussen, Edie M. (2003). Searching for Images: The Analysis of Users’ Queries for Image Retrieval in American History. Journal of the American Society for Information Science and Technology, 54 (6): 498-511.  Fidel, Raya. (1991). Searcher’s selection of search keys: Controlled vocabulary or free-text searching. Journal of the American Society for Information Science, 42 (7): 501-514.
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 References (con’t)  Greenberg, J. (2001). Optimal QE Processing Methods with Semantically Encoded Structured Thesauri Terminology. Journal of the American Society for Information Science and Technology, 52 (6): 487-498.  Harpring, Patricia. (1999). How forcible are right words!: Overview of applications and interfaces incorporating the Getty vocabularies. Archives & Museum Informatics: http://archimuse.com/mw99/papers/harpring/harpring.html  Hearst, Marti et al. (2002). Finding the flow in web site search. Communications, 45: 42-53.  University of California, Berkeley: Flamenco Project -- http:// bailando.sims.berkeley.edu/flamenco.html  University of Michigan: SI Art Image Browser --http:// www.si.umich.edu/Art_History/
  • Indiana University Digital Library Program Brown Bag, May 7, 2004 Shout Out!  Thanks to the Cushman Team comprised of Archives and DLP members especially . . .  Randall Floyd (Database Guru)  David Jiao (Java Genius)