Your SlideShare is downloading. ×
Just keep clicking Till You Find It: Building a Library Digital Collection Interface with Browsing in Mind
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Just keep clicking Till You Find It: Building a Library Digital Collection Interface with Browsing in Mind

2,794
views

Published on

Developing navigation tools for browsing in a Digital Collections interface (clouds, facets, links to other digital resources) using existing metadata. Will examine how tools are developed, how they …

Developing navigation tools for browsing in a Digital Collections interface (clouds, facets, links to other digital resources) using existing metadata. Will examine how tools are developed, how they work; what are users' reactions

This session will explore how East Carolina University's Joyner Library developed an interface to their digitized special collections to facilitate user browsing. The library's digital collections contain thousands of items digitized from hundreds of collections – in some cases only one or two items are digitized from a collection. This hodge-podge approach is a result of the library's image management practices which attempt to store materials digitized on a daily basis (for patron requests, preservation concerns, publication or exhibits, etc.) into the publicly available digital repository.

As the repository was being developed, the staff of Joyner Library decided that the traditional approach to presenting digitized special collections materials as a sort of online "exhibit" where materials are selected to illustrate a theme or to systematically convert an entire collection to the digital format would not work. Instead, the staff experimented with different ways to enhance user browsing through materials. They looked to the world of commercial websites, next generation catalog interfaces, and social networking sites to develop a suite of navigation tools that enhance serendipitous discovery using their own home-grown solutions that are built on top of an SQL database and an XML database. The final collection interface includes: broad thematic "collections", "tag cloud"-style navigation, and a faceted-browsing refinement tool, all developed from cataloguer-created subject headings; hyperlinked terms in item records to facilitate broadening searches; links back and forth between collection finding aids and other digital resources at the library; user commenting and tagging of resources to begin to integrate emerging folksonomies.

Published in: Education, Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,794
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Hello, my name is Gretchen Gueguen. I'm the head of Digital Collections at East Carolina University. I'm going to talk with you today about creating a new digital library at ECU. I'm going to focus particularly on how we tried to implement features that suited our users research styles, and why we believed browsing to be foremost among them. Today’s talk is going to center on tools for accessing archival and special collections materials. I know that a lot of the profession is looking at large scale book scanning or faculty-created research, but our special collections are a rich archive of content unique to the institution. And being a regional university, and about an hour from a major airport, having this content online has a much higher impact for us than any other group of materials.
  • So here is an overview of what we’re going to talk about today. I'm going to start by examining why we needed a repository. We’ll discuss the “behind the scenes” considerations taken in designing not just the repository, but the whole digital library program. Next I’ll look at the research we did on our users and how they specifically use this material. Then, we’ll examine the creation of the digital library itself : repository structure, metadata design, and the interface and browsing tools we created. Finally, I’ll end with some evaluation of the product and how it is actually being used.
  • So we start by taking a look at what was going on at ECU, and why needed to create a tool like this. ECU, as a I said, is a regional university, but it is pretty large : about 27,000 students -- mostly undergraduates. Joyner Library, where I work, serves the academic needs of most of the campus (music being the exception with its own facility).
  • The Digital Collections department has existed in some form or another since the late 90s. The current incarnation is part of the Special Collections department and has 4 full time staff members involved in programming, metadata, digitization, and web development. We also employ 5 to 10 student employees each semester: a mixture of undergraduate and graduates who do the majority of the digitization and some of the more routine metadata creation. This is a picture of the fake band we pretend we’re in…it’s always dangerous to have Photoshop experts on your staff, you know…
  • Joyner began work with digital initiatives, like many universities, in the late 90s…doing “exhibits” – small, static, HTML pages that mimicked the kind of experience you’d get in a real-world exhibit – in fact they were begun as extensions of actual in-house library exhibits.  While the exhibit has many virtues, it is much better for showcasing material than in-depth research.
  • The exhibit model is oftentimes followed a larger model, the collection, in which search can be carried out, but items are still all related to a single theme. This slide is an image of a large collection created by ECU around 2003 which we still maintain. It holds the full text of some 400 books and images and video of just over 100 artifacts from local museums. All of the objects are related to the history of Eastern North Carolina. In general, collections may contain a lot of material, may in fact might be no different from what we call a digital library, but we distinguish a “collection” from a library by the thematic focus which shows some control over selection and content.
  • In addition to the creation and ongoing maintenance of exhibits and collections, Digital Collections also works on a few , let’s say “side projects” The first is maintaining the archive of XML finding aids for the special collections department. Finding aids describe entire collections, in most cases, in the aggregate: series, boxes, and folders rather than item-level descriptions, as is the norm in catalogs and most digital collections or libraries. Despite being the main artery to the same special collections materials that populated the exhibits and collections, prior to 2009, these finding aids were completely separate from our other digital initiatives. In 2006, our work also began to include ad hoc requests -- on-demand scanning for patrons, scanning for library exhibits or marketing materials, etc. Whatever happens to pop up. As an economy of scale it just didn’t seem feasible to do this scanning as a one-time-only thing, so we wanted to create a system that would enable the discovery of that material as well.
  • By 2007 the act of creating multiple exhibits and collections was becoming a burden. In short, we had too many silos of information for our hypothetical user to have to search though. As you can see here, the collections and exhibits were in one place, the ad hoc items not accessible at all, and the finding aids somewhere else. This was not good for them and it also wasn’t good for us to have to maintain all of this separate infrastructure.  Instead, we needed a common system to host collections, exhibits and other digital objects as well as finding aids. We further needed something that would allow users to search across all these sources…  But would also keep the context of exhibits, collections, and finding aids intact for browsing and other discovery methods
  • That is a lot of purposes to serve with one tool though (essentially, it’s an image management and user discovery tool of materials with differing levels of description and context). To be able to handle these needs we knew that the system would need three specific qualities: First it would have to be UNIFIED with some interoperable and sophisticated metadata standards (or actually a combination of them) and robust architecture that could handle the multiplicity of content and usages. The system also needed to be Flexible, . It had to have the ability to handle complex objects of all different types: text, image audio, video or any combination of them. Each object also had to have multiple relationships: multiple collections, translations, editions, etc. Finally, the system needed to have modularity , although I don’t mean this in the strict programming sense of the word. Instead, we needed the repository to support different types of tasks with appropriate tools. For example, in addition to a killer user interface, the repository needed to have really good administrative tools available so we could have staff at all levels of the organization use the system to do things like request digitization services, create technical and descriptive metadata, track digitization projects, etc.
  • With that basic sense of what the backend needed to do for us, we moved on to figuring out how the repository would suit the needs of our users. The first question is: “Who are the users of our materials?” It actually is a big (and ongoing) question. Naturally on the web things get seen and used by people I never see myself. One way that we could guess who these materials might be most useful for is to look at who uses the material in it’s analog format…who is using the archival material we digitize as opposed to scientific data, general collection books, faculty research.
  • And it turns out we weren’t the first people to ask that question. This is a table (from a recent OCLC Research Report )that synthesizes a set of RLG reports on source materials by discipline (I’ll have the citations to this or any other source I mention at the end of the presentation, by the way).  So using this chart, we know that the materials we are talking about are in these columns: audio/visual material and archival material.  In addition, the disciplines using those materials are these ones So, with a few outliers in the sciences we were dealing with  humanities researchers (like Dr. Henry Jones, Sr. here, professor of medieval literature. The next question then is how do humanities researchers do their work? And again we were far from the first people to ask that question.
  • That same report summarized a lot of the existing research on faculty. They found that: “… humanities scholars and other researchers…rely heavily on browsing, collecting, rereading and notetaking. They tend to compile a wide variety of sources assembling, organizing, reading, analyzing and writing.” The research found, naturally, that these scholars expect to be able to use a large and diverse set of primary resources, with a full searchable text available where appropriate. Yet, due to the amorphous nature of their research topics, they often do not do precise searching and prefer to browse through a lot of materials. But, in addition, those scholars wanted to create their own context, their own notes, create their own personal collections of what was relevant to them and their particular research question….the question is whether there are digital tools that do this... Faculty also want better tools to be able to incorporate digital primary sources into their classes. So this characterizes a research that is initially broad, and is especially engaged with primary resources, but which interacts with that material in a very deep way. The next step would be to look at resources they actually use…
  • So let’s start with the traditional way to find materials, the catalog. As we are all aware, increasingly, catalogs are also offering tools that help those who do broad searches browse through or refine results  As we see here, the faceted refinement in this endeca catalog supports an iterative search process (broad search, evaluate, refine). Some of these next-gen catalogs are also providing access to materials across different sources, providing that wide variety of sources. As we discussed earlier though, the catalog is more often than not describing item-level material, therefore losing the ability to do similarly sophisticated searches of archival materials that are often described at the group level.
  • On the other hand, the online finding aid, the primary tool for accessing materials in archives, almost exclusively uses that aggregate level description So this is a piece of a finding aid from our own collections at ECU, showing the container list (where the actual items are described). This one in particular is a small collection and the description is only at the box level…  And here is part of the container list of a really big one that is described down to the item level in some cases. Most finding aids on the web at the moment don’t really interact much with digitized content. In this case you can see a link out which will pop up the scanned image listed here…
  • But there have been some really great projects that integrate digitized materials with finding aids, like this one from the Archives of American Art. Once you click on one of those series for example…  and navigate down to a folder, you can interact with the materials at that level, in the aggregate. So it’s a lot more like actually going through the collection, folder by folder…It mimics traditional research and more importantly it provides context (because you can see the whole collection together) but also the ability to find things without discrete searching…it allows browsing through materials. But at the same time, in the archival setting a scholar could do some, at least minimal, marking (with flags) or sorting (you’re not supposed to reorganize the collections, of course). They can also acquire copies and create personal collections that have the context they are creating. In short, without some way to do that kind of work, this tool is somewhat limited.
  • Tools that do more than just find and look at resources are increasingly being created by scholars themselves. This is a tool created for a project I previously worked on called The MacGreevy Archive. In this particular tool, the scholar behind the project could place side by side her transcription with her own additional annotations (you see one there in the blue box), scans of the original text, as well as biographies of people mentioned in the letters and basic bibliographic info (in tabs that are not shown here. While in this case, a scholarly audience couldn’t necessarily add more annotations or download and use this tool on their own texts, the point is that the tool is used by the scholar to actually do her work, rather than just to find or view resources. There are many other projects though that do offer tools that can be downloaded and used as part growing “Digital Humanities” movement, as well.
  • So that was a pretty good summary of one group of users, but on the other hand, ECU is an academic institution that, by and large, is concerned with education. The missions of the university, the library, and digital collections in particular all stress education as a primary focus. And for us, that population to be educated is overwhelmingly undergraduate students. There may be scholars from across the world using our materials, but we would need this tool to suit the majority of our user base:  students, particularly students in humanities courses (although, full disclosure that is Max Fisher (a high school student) doing the hardest geometry problem in the world…I just like the picture)
  • So, undergraduates may be one of the most studied creatures in the academic animal kingdom. There are many studies about their research habits, their skills, their motivations and behaviors. Some basic thoughts on this group are that they have simple search strategies because they are often looking for generalized or broadly-based materials. In my own usability studies, I’ve found that when they have difficulty forming search strategies, undergrad are happy for guidance like lists of topics to browse. One student told me in a test that her search strategy was to “Just keep clicking until I find it” – which gave birth to the title of this talk. Although part of a “digital native” population, they spend a lot of time using new media, although they may not be savvy about the creation of it . They are often unsure of what materials are appropriate for their coursework and need some guidance. Expectations for this group include an idea not exactly that everything is online, but that you can do well enough with what is online. They also approach any search tool with the implicit idea that it works like Google or other search engines (keywords connected with boolean operators and results in relevance order). Unlike our scholars, undergraduates are unfamiliar with finding aids (which is unsurprising, since their use of them is low).
  • So again, the next step was to look at actual sources students use. I’ll start with wikipedia, used by a lot of students to start research. So what does Wikipedia have that makes it easy use? Well, I’m actually not going to talk about it’s “wiki”-ness. Instead, what I think makes it easy to use is the prevalence of hyperlinks…so the student is guided through the topic with easy ways to find additional, relevant information. The context is there in the links. One report I read referred to this activity as “bouncing.”
  • The next sites I wanted to look at was Amazon. Like Google books, Amazon has been touted as a sort of alternative catalog for many students because it is perceived as easier to search. But just for fun, let’s start with a catalog, our catalog at ECU. So, I did a search for “john donne apostasy” (if you don’t know, the English poet John Donne was a famous apostate from the Catholic church) the same way I would search any other resource. The catalog actually finds nothing… this is actually a list of subject heading it suggests I browse instead.  Next, I’ll try Amazon…and I do get results and ones that are relevant. In fact, this first one is also a part of Joyner’s collection, but I couldn’t find it through Joyner’s catalog because the text wasn’t search and the query I entered was searched as a phrase. In addition to that, I also find lots of ways to locate other relevant material…lots of options to browse at other stuff you might want to buy and things it recommends for you. If I can’t remember the precise name of something, the links here, as well as this organization of Departments and other kinds of groupings can really help me find it. The interface is essentially set up to “guide” you and ensure that you always have a clue to find something else. This is obviously, the goal of a commerce site, but it could also be pretty helpful to that undergrad who wants to keep clicking.
  • So we have two different user groups with different needs: one often doing deep and narrow research and others doing shallow and broad. Our dilemma was to try to find ways to present information that could offer features to both groups. We broke the different features we saw into three broad categories  the first was Organization and Guidance, with our researchers more interested in the former and our students more in the latter. But organization and guidance tools both allow materials to be grouped at different levels for different purposes.  Things like broad categorization, faceted manipulation of results, the option of viewing things within their archival “finding aid” context or not, These tools would allow the searcher to go very deep, without requiring them to and could also be used to help suggest resources to one who might need a guide.
  • The second area in which we identified features was what I’ll call “Data-driven” discovery. These are tools that take metadata and expose it in various ways to identify underlying relationships. These are related to the first tools in their exploratory nature, but they do not require the creation of new hierarchies or organizations.  For example, a feature like a “tag cloud” could be useful for many different things, they can provide a dynamic way to browse the collections, but they are also sometimes used to gauge the scope of a collection before working with it. They can highlight relationships between materials that might previously have been lost.  On the other hand, the ability to automatically retrieve similarly indexed items through hyperlinked terms in records provides the opposite power…to explore the limits of the collection by tunneling back out to other materials once you’ve found something relevant  Finally, I include full text here to really indicate our increasing ability to search across more and more. As we are I’m sure all aware, searching across more data has become increasingly important…
  • The last area that we identified a need for features was “personalization.” Although these are the types of features that are most touted as “2.0” and “youth-oriented” we actually felt that offering personalization and customization would actually benefit those researchers who needed to deeply engage with the content.  For example, with the use of tags, A researcher can create a personalized categorization.  In addition, tools that actually help researchers create personal sets that they can work with are even better (the text there say “drag items into this tray to add them to YOUR collection”)  Finally, tools that allow users to actually use, analyze, and otherwise manipulate data give our users the chance to not just passively consume the content but actually do the work that they want to do in the research space. This is an example of an interactive mapping application that overlaps maps from different historical periods, but here I really mean any tool like the digital humanities tools we looked at earlier where the text is used in some way.
  • So, with all of this research in hand, we set out to build the perfect mousetrap.
  • Building a repository from the ground up to make navigation as flexible as possible meant we had to start thinking about those interface features even at the most basic level: from what system to build it on and what metadata schema to use.
  • Our first decision really, was what system to build it on. We did an initial review of digital asset management and repository software and realized that, no matter we wanted to build it ourselves to create the kind of interactive end-user and robust backend tools we felt we needed. But we also had to compare the system we had been using to run the large collection (called the Eastern North Carolina Digital Library, which I mentioned at the beginning of the presentation)as well as the finding aids database, to a new crop of open source tools like Solr and Lucene that were behind a lot of digital library development. The digital library project was begun 2003, before a lot of the open source tools were widely used. It was built on a product called TEXTML, an XML server software and the web application was built using the ASP.NET framework and C#.  On the one hand, TEXTML was understood by us, well supported and was currently being used on two projects. On the other hand, if we started over with something, especially something open source, there was the possibility the product could be more flexible, but it would mean starting over and particularly for our programmer probably an initial learning curve. It would also mean either running two systems or migrating our other systems to the new one as well. In the end we chose to continue to use TEXTml and ASP.NET since it provided the same basic functionality but did not represent a learning curve for us.
  • So this is a really simplified diagram of the repository. We have a web application layer, for the online access, it interacts with TEXTML and its index of the METS documents, which contains links to the image/audio/video files. So this is really pretty standard, a lot of repository architecture looks like this. But I wanted to start with this, since we are going to be adding complexity later.
  • The main portion of the database is that archive of METS XML records. Each "object" in the repository, which might consist of various files, has a METS record providing the underlying structure. We record several types of descriptive metadata in the METS dmdSec to provide that flexibility to handle multiple types of objects. MODS is used for the main descriptive metadata and an additional Dublin Core section is scripted from that specifically for OAI harvesting. We use TEI for any full-text transcription whenever it exists, and record tags and comments in a locally defined sec. There is an amdSec for the technical metadata for each associated file: MIX, AudioMD or VideoMD. At the moment we are not storing any preservation metadata other than what is found in the technical metadata. In the METS fileSec we keep track of a persistent URL for the access copies of the images. Unfortunately, at the moment, we’ve only created a placeholder for the masters as we are currently working on a preservation strategy to include long-term storage. At the moment however, this record does record the filename with which we can track the preservation master. Our finding aids are all described in EAD files, which are not represented in METS files. they exist in a separate XML index and repository. I’ll discuss EAD more later…
  • So, to just take a closer look at that METS: In DMD0001, we have MDTYPE MODS the main descriptive MODS record  In DMD0002, we have a MDTYPE dublin core record  In DMD0003, MDTYPE “user” we record user comments and tags  And in DMD0004, MDTYPE TEI, we have the TEI (although just the body because the head would be redundant wit the MODS) if one is present  For some objects with multiple files we needed to have separate captions or descriptions for each file in a group. This required a separate dmdSec with a partial MODS record for each image relating back to the files in the file group. So in this example,  we have another DMDsec with MDTYPE MODS, but it refers back to GROUPID “3” or the third image in the object and it’s a group because it’s both a master and access copies. This also allows us to retain one digital object identity, but with flexible descriptive metadata specific to specific images or files.
  • With the details of how individual items are developed in place, the next issue was overall organization. Although we were developing one digital library, we were going to have to deal with some existing subcollections and we’d probably need to develop even more as a basic feature needed for organization and browsing as well as for our own internal management. For example The East Carolina Manuscript Collection  has over 1700 “sub collections” some very small and others very large, many with only one or two items digitized, if any. Presenting them as simply 1700 individual collections didn’t seem the best approach. We want to retain those analog collection and departmental relationships, but we don’t need to limit ourselves to them. In fact, the user studies we did showed us that a way to browse that was more general and related to content would be preferred
  • So we created about 20 collections based on the themes that were present not just in the already digitized content, but those that are important to our collections. Objects were assigned to these collections based on a mapping of subjects headings and all subsequent items are added to these collections as appropriate. These broad thematic collections coexist with a smaller number of very specific collections created for special reasons, So the African-American history collection and the Agriculture collection are side by side with the very specific collection related to a local figure, Alice Person created as part of faculty research.  In the basic implementation, these collections have a very simple functionality...a little introductory "jump" screen  and then the collection of items.
  • If we want to add a little bit more context, we can create some context screens. In this one, there are a number of pages like this, that add some information about the collection and help guide the user through it.
  • On the other hand, when necessary, we have the option of really building a customized site. This lives within the web application layer and interacts with the same index and XML/document core and  we can reuse the web application coding for basic behaviors
  • The next basic feature to work out was the searching. The searching options with TeXtML make any query a phrase search. We adapted this behavior to place an “AND” boolean operator between terms to conform more with Google-style searches. The index was created of key fields from the MODS record, as well as every word of the TEI if present and eventually the tags and comments. A simple inverse document frequency algorithm of all of those searched fields was used to rank the results by relevancy Once the XML is returned by TeXTML, the ASP.net application handles the layout and display. This is the first iteration of search results, which you see doesn’t have that faceted drill-down yet. A hit for a record pulls a thumbnail from the directory, or a default icon when the item is not an image. these thumbnails are one of two derivatives of the file that we created for the basic implementation.
  • The other is the one that is retrieved in the item record. The item on this page is a standard size of 800 pixels on the longest edge. Although we knew that our users would want images at a larger size, we hoped that this would be large enough for most uses but small enough that we wouldn't eat up all available server space in six months. However, we know that our primary audience really needs to study these resources in depth, so for images of text or maps that would normally be illegible at this size, we do offer a large option. This fall we will begin working on a solution involving JPEG2000 to enable zooming and we hope that will solve some of our problems.
  • That was just the bare bones of the repository, though. With those in place, we get to move on to the fun stuff, the dynamic features that facilitate a lot more of the browsing…
  • The most interesting dynamic feature, I think, is the "tag cloud" or really subject cloud, that we created. Some of the usability work that we read found them preferred for tasks where information-seeking was general (as we saw especially in our student users) so we thought it would be a neat way of visualizing what the collection was "about". Of course, we didn't have any tags yet, but we did have cataloguer supplied subject headings… Those, of course, usually come in long strings, but we thought, hey the sub-fields in these string are kind of like tags themselves in that they are single concepts...what if we broke them up an then made the tag cloud out them? so that's what we did. The MODS records actually record the subfields in separate elements, so they were easy to break apart. We hoped that by breaking these subfields up, we would find some more linkages between concepts that full subject strings alone would provide. For example if I clicked on a subfield that said "employees”
  • This brings together "employees -- service stations" with "employees -- telephone companies" "employees -- …anything else." These results might not have been seen together were the entire string used in the cloud or any other visualization. On the other hand there are a whole lot of subfields. 3,208 unique ones. If we look at the repository as a whole (and this data is from August)... we see the following distribution.
  • I should preface this section with the caveat that I’m not a mathematician, however… This is a logarithmic distribution of all the subfields used. There are 3,208 of them so the logarithmic breakdown makes it a bit easier to see what’s going on.
  • So, the most used term, this top dot, is used just over 20,000 times and the 2 nd most used term is used just over 10,000 times. That starts to sound like a standard adherence to Zipf’s law: a statistical probability often found in text corpora, which states that the second-most-used term is used ½ as much as the first, the third 1/3, the fourth ¼, etc. However, it begins to deviate after that.
  • So the Zipf law would predict a slope of -1, seen in the green here, but the actual slope there is -.81. That isn’t too bad, but what is probably more significant is this gap:  from the 5 th most used word and the 6 th …a gap of nearly 5,500 usages. One reason this gap might exist is due to the nature of the material that has actually been digitized. Although the repository was set up for multiple purposes, just under 75% of images come from the same subcollection of local history images – a grant funded their digitization. What is obvious thought is that our most used terms are used so often that they significantly dwarf other terms, beyond what would be reasonably expected.
  • So if we used a "most used" tag cloud, which we see here, the top 20 or so terms probably won’t change for years, if ever. Maybe that's okay. It is what it is and its showing what the collection actually contains. But it's not giving our users the chance to explore the rest of the repository or facilitating browsing.  To counteract this, we also included a random assortment, taking a random 100 headings and sorting them. Every time you hit the “shuffle” button, it grabs another 100… Granted this is not the best method for systematic research, but we thought it was inspiring and a novel way to “browse” around and get a sense of some of the content. One of the issues of the random cloud is that the sizes of the tags are determined relative to each other, not to the entire repository,  so in one random sort “Kennedy, John F” is really big compared with the other terms.  But if we go back to that top term list, overall it’s relatively small. There is potential for confusion here over what is really in the collection, but we continue to think that the ability to explore is more important than any of these issues. We did do some usability tests on these features, which I don’t have time to go over in depth, and the results were a little mixed. While the features were heavily used, and even preferred, users didn’t always understand where the tags were coming from or what they really meant (like they thought they were the “best” or the “most clicked on” terms or that the list represented ALL terms, etc.
  • And of course there are outright failures of a sort, for instance bridging subfields, like "training of“, appear as tags in their own right when they are used in the subjects like Teachers--Training of--North Carolina--Greenville .  Hindsight is 20/20 of course, and I wish we’d anticipated this better and weeded out some stop words. As it is, we will probably do a weeding of terms like this in the future, and hope that, while flawed, the system overall does provide an enhancement to relying solely on search for access to the repository.
  • Since we had already broken up the subfields, when we moved on to discussing the faceted browsing through search results, we decided to use them again.  The geographic subfields in the subjects were removed entirely in favor of the data in a separate coverage that was recorded using the LC FAST schema to drive the place facet.  Other "facets“ like collection and format are from controlled vocabularies, but date also presented special problems because of all of the variables...year, month, day...as well as the inclusion of date ranges. After much debate we finally decided to strip out month and day information (if it even existed) and to include ranges as their own values.  the final feature of the facets that we added was the ability to resort them either alphanumerically or by occurrence, by hitting the a-z icon. And once again, we have the same benefits and drawbacks, while it definitely brings together many terms that would otherwise have been separated, it creates a ridiculously long vocabulary. Around 50% of the terms only appear once, 75% of them appear twice or less.
  • Moving along to the look at the actual record, our philosophy was once you found something useful, it had to be easy to find others like it...  So within individual records we create hyperlinks. Particularly, with the subjects, these are actually to the full subject strings, So while you might have found a picture of an “employee” through that tag cloud link, you can now narrow down to just the subset of public utilities employees by following this hyperlink. We hope that this offsets the broad, guided browsing feature of the tag cloud by providing a powerful way to both expand and narrow searches.
  • Two of the personalization features we were able to implement were commenting and tagging of the records. I mentioned earlier how we handle searches of this data and its integration into the MODS record. This is an example of a record with both comment and tag.  You’ll see we implemented re-captcha there to avoid spam. Once these tags and comments are entered…
  • They are actually sent to an SQL database. In fact, to control most of these dynamic elements we are using an SQL database as an intermediary. So this is what the new diagram of the system looks like. The web application layer now has a set of functions handling the dynamic action. The SQL database is where the comments and tags get written. It also is used for entering metadata changes that are handled through some of the administrative tools that we created, which I don’t actually have time to go over today. The SQL database is where the subfields are stored individually (they are complete strings in the METS/XML document) and then pushed back out to the interface. The SQL database sends updates out to the METS records when a staff members updates a metadata record. Tags and comments are written in a nightly index. Like the usage of TEXTML in the first place, some of the usage of this SQL layer had to do with current practices and making a reasonable choice with the resources we had. But it also provides some flexibility as it is relatively simple to develop the web application layer to work with it.
  • The final features of the system that I’m going to discuss are those that go beyond the single repository concept and attempt to create a more holistic approach to discovery by making some of our systems compatible with each other. This is our first step in offering an overall research tool.
  • So the library has a number of other online research sources, and we thought it might be interesting to try and integrate some searching of it, with searching of these other resources. In particular, two of these resources  are also created and maintained in Digital Collections.  Of the remaining, it seemed that the catalog was the most important and the easiest to attempt to search…
  • since the finding aids and the ENCDL were both indexed in other TEXTML indexes, we could easily search these other indexes and get exact numbers of hits for them. We included these links as a way of expanding searches.  We can also provide a link to expand this search out to the catalog, although because of load time we don’t provide the number of hits at this time, or even assure that any hits will be found.  But, as you can see here, this feature is probably going to be most useful in a failed search, as this one. Although I didn’t find anything in this database, I do see that there was 1 result each in the Digital Library and the Manuscript Collection. At this point it’s just links, but as this service grows and changes, we hope to take a more “discovery layer” approach and actually have the ability to interact with different types of results.
  • So this is what the new diagram of the system looks like. The user search is now searching the index of three resources, all of which are run through similar TEXTML databases, but use different XML schemas.
  • While we have developed some really nice features in digital collections to work with the finding aids (to search them simultaneously) we also decided to spend time revamping the finding aids to work with the repository. Our text markup coordinator, Mark, spent a great deal of time redesigning the style sheet and interaction. So this is what the findings aids looked like when they were pretty much independent of the repository. The interaction with the repository was very simplistic, just links out to that resource…
  • And this how Mark redesigned them. Since we were recording the specific location of the materials within the archival context (the collection.box.folder.item), we could find the associated items and display them within the finding aid very easily.  So here you can click a tab and immediately see everything digitized from the collection.  Within the container list on the previous tab, there are also links at folder level, rather than just item level, so you can get access to groups of materials. In addition to providing a powerful tool for humanities researchers, as an introduction to finding aids for our undergraduate users, who do have some limited interaction with the system for their introductory English classes, this gives them a more web-friendly way to navigate the system and learn more about the concept of archival research.
  • Okay, so I've gone through a lot of the features that we created, but the real question is, does any of this stuff work? Do we deserve medals?...  Or angry protests?
  • First, looking at the different features identified earlier, we did a pretty good job. We integrated something to address all of the features in Organization and Data-Driven Discovery, but we do need to work on our Personalization features.
  • In terms of actual usage, we are averaging about 300 hits per day and the average time on the site is 4 minutes. That is good as these things go, but clearly not "research" amounts of time. We've averaged 54% of our traffic from search engines so far, but that is steadily rising. Looking just at the current month the percentage is in the 70s. The earlier traffic, coming solely from direct links from email announcements and referring sites is skewing the overall numbers a bit. I think overall, this points to our need to do better PR for the repository and work with our library staff and faculty to educate about the availability of the resource. Relying just on Google is getting us the general users who are coming for reasons other than research, and our statistics show that. They do also show, however, that our most used portions of the site seem to involve local folks coming here to look at the local history collection.  The local newspaper ran a story in their online and print editions of the Sunday paper on March 16th and we got 931 visits that day.
  • And we continue to get a lot of traffic to those local history images from local people adding their own memories to our comments. And one of our oldest staff members, who has been a resident for more than 70 years, has been setting aside an hour or two a week to do nothing but go through and add commentary. The collection was cataloged mainly by two people, neither of whom were originally even from the South. The insight that the comments provide in many cases, specifically into agricultural and tobacco cultivation are really valuable. One of our catalogers plans to go back through this collection next month and officially update any metadata records with verifiable information supplied from these comments. Although this was not the use that we intended, it has been very nice to see the collection become such a community resource. Although at the moment it is engendering a lot of nostalgia, we do think that at least we have provided a venue for capturing information that might otherwise never have surfaced. While the official newspaper record will continue to be preserved in our stacks, we are now capturing an unofficial, first-hand, narrative of events to accompany it., which we hope will be useful to all audiences – including scholarly ones – in the future.
  • In other ways though, we have been used as we intended. Although these uses are not the hard indicators of use like the web statistics, we do know and hear about uses of the collection and have even been asked to make changes and additions. We’ve been used in an assignment that every Freshman English class has to complete using a primary source. We have also been asked to add a collection specifically for the use of the Interior Design department this fall. We also know of two scholars who have extensively used the materials we digitized from our collections related to the artist and poet A.R. Ammons. In fact, the materials were digitized for the use of one scholar and were then useful to another…exactly the kind of usage pattern we hoped to see when we decided to keep all of that “ad hoc” scanning. Finally, our markup coordinator, who worked on the EAD revamp did this little analysis of what collections were actually being requested in the reading room, compared with those we had digitized some portion of. He found a pretty strong overlap between these variables suggesting that if we could perhaps not call what we are doing “mass digitization” in the Google sense of the word, at least our online collection is achieving some level of mass representation…accurately providing our most used and useful items.
  • So, to wrap up I’ll just say that I have noted several areas in which we are going to continue to do some work and we are going to continue to gather some feedback. For example, we are going to work on better access to high resolution images, and we are trying to begin to work with other library tools on better cross-discovery. We are also working on increasing outreach. Our experience with working with a handful of scholars so far leads us to think that if more new about the repository, more would use it. The biggest experiment, the use of the existing subject headings, has been just as we thought, not perfect but a pretty economical decision given that it is a reuse of existing data. We may be reviewing the original list of 20 or so general collections. With around 11,000 digital objects a group of 20 makes collections to large for browsing. The lesson I've learned from this is I can put in lots of nice tools that I hope will help research, but if 75% of my collection is on local history from the early 60's 99% of my traffic will be from local community members looking for people and places they remember. However, the tools we created here are ready to be used further, and have begun to be used, and I hope that through content selection and PR we can attract more researchers to the site. At this point, I’d be happy to take any questions…
  • Transcript

    • 1. Just Keep Clicking Till You Find It Building a Library Digital Collection with Browsing in Mind Gretchen Gueguen, Digital Initiatives Librarian J.Y. Joyner Library [email_address]
    • 2.
      • Behind the Magic…
        • Designing the Program
      • Our Adoring Public…
        • Users and their uses
      • A Better Mousetrap…
        • Designing the System
      • Evaluation
        • Analytics, Usage, and Other Indicators
    • 3. Miss Pitt County rehearsal. (1966). The Daily Reflector Negative Collection. http://digital.lib.ecu.edu/8837 Behind the Magic
    • 4. Introducing Digital Collections...
    • 5.  
    • 6.  
    • 7. Finding Aid Interface “ Ad hoc” Digitization Requests
    • 8. Digital Collections and Exhibits Digitized Materials not included in a collection or exhibit (ad-hoc digitization) Finding Aids to digital and analog collections User Repository
    • 9. System Needs UNIFIED FLEXIBLE MODULAR
    • 10. Masquerade party in China. ( undated ). James N. Joyner Papers. http://digital.lib.ecu.edu/1554 Our Adoring Public
    • 11. Typical Users
    • 12. Humanities Users
      • Users of archival materials
        • How do they search?
        • “… humanities scholars and other researchers… rely heavily on browsing, collecting, rereading and notetaking. They tend to compile a wide variety of sources…assembling, organizing, reading, analyzing and writing.” – Palmer, et.al. 2009
        • What do they expect?
          • Diverse primary resources
          • To be able to create their own context
          • Better pedagogical tools
          • In other words…access to primary sources and tools for deep reading and interpretation
    • 13.  
    • 14.  
    • 15.  
    • 16.  
    • 17. Typical Users
    • 18. Undergraduate Users
      • Users of Research Materials
        • How do they search?
          • General, thematic searches
          • “ Just keep clicking until I find it”
          • Familiar with the web, but perhaps not research resources
        • What do they expect?
          • Everything they need is online
          • All searches are like Google
          • What’s a finding aid?
    • 19.  
    • 20.  
    • 21. Organization/Guidance Now What?
        • Finding aids enmeshed with digital objects
      Broad Categorization
        • Faceted manipulation of results
    • 22. Data-driven discovery
        • Tag clouds (for serendipity and browsing, gauging the scope)
      Hyperlinks in records Full Text Search
    • 23. Personalization
    • 24. The improved Jordan grits separator . (1915).F. Rehm and Sons Company Records. http://digital.lib.ecu.edu/803 A Better Mousetrap
    • 25. REPOSITORY BASICS
      • Layer 1
    • 26. Repository Basics: Architecture Vs. familiar supported Already in use shareable More flexible? Starting over
    • 27. Image/audio/video files XML documents/ METS Index/TEXTML User search Search results Read Send Web application / ASP.NET
    • 28. Repository Basics: Metadata
      • Database built of METS records
        • dmdSec
          • MODS
          • DC for descriptive sections
          • TEI when transcriptions exist
          • Locally created sections for tags/comments
        • amdSec
          • MIX/AudioMD/VideoMD
          • Currently no preservation metadata other than what is already captured by MIX
        • fileSec
          • Placeholder for Master
          • Location of Access and Thumb surrogates
      • EAD schemas integrated separately
      XML documents/ METS
    • 29.  
    • 30. 37th U.S. Colored Infantry Regiment History (#MF0040) 5th North Carolina Infantry Regiment Collection (#874) A New and Correct Map of the Province of N.C. (#MC0035) Abernethy, Charles Laban, Jr., Papers (#98) Adams, Faye B., Oral History Interview (#OH0251) Agricultural Resources Center Pesticide Education Project Records (#905) Ainsworth, Walden Lee, Papers (#250) Albright Family Papers (#70) Albright, R. Mayne, Oral History Interview (#OH0036) Alder, Mavis M., Oral History Interview (#OH0217) Alford, Mike, Collection (#1094) Allen, Sarah, Papers (#471) Americae sive Indiae Occidentalis Tabula Generalis Map (#MC0023) American Legion Pitt County Post #39 Papers (#120) Amis-Clark-Puryear Papers (#474) Ammons, A. R., Papers (#1096) Anastasion, Steven N., Collection (#913)
    • 31. Collections
    • 32.  
    • 33.  
    • 34.  
    • 35.  
    • 36. DYNAMIC INTERACTION
      • Layer 2
    • 37.  
    • 38.  
    • 39.  
    • 40.  
    • 41. 20,018 10,076 7,932 6,856 6,151 Slope: -1 Slope: -81
    • 42.  
    • 43.  
    • 44.  
    • 45.  
    • 46.  
    • 47. Image/audio/video files METS/XML documents Index/TEXTML SQL database Read Send Write Admin form/add tag User search Search results & comment/tag cloud/ faceted results Web application / ASP.NET
    • 48. ADDITIONAL RESOURCES
      • Layer 3
    • 49. Additional Resources Eastern North Carolina Digital Library East Carolina Manuscript Collection Guides Joyner Library Catalog Research Databases LibGuides Library Website Internet
    • 50.  
    • 51. Image/audio/video files METS/XML documents Index/TEXTML SQL database Admin form/tag/comment User search Search results Web application / ASP.NET Local XML documents ENCDL Index/ TEXTML EAD/XML documents EAD Index/ TEXTML Image/audio/video files
    • 52.  
    • 53.  
    • 54. Presented award . (1964).The Daily Reflector Negative Collection. http://digital.lib.ecu.edu/6798 Evaluation Workers holding protest signs . (2007). Workers Vanguard no. 891. http://digital.lib.ecu.edu/843
    • 55.
      • Organization/Guidance
      • thematic collections
      • faceted refinement
      • better integration with finding aids
      • Data-driven Discovery
      • subject cloud
      • searching full text across multiple sources
      • hyperlinks in records
      • Personalization
      • comments and tags
      • personal collections
      • tools for reusing collections
    • 56.  
    • 57.  
    • 58. Anecdotal Evidence
      • Use in English 1200, History Classes
      • Use of A.R. Ammons collection by several scholars
      • Will add a collection specifically for the Interior Design program in the autumn.
      • Not “Mass Digitization” but “Mass Representation”
    • 59. Orville Wright Glider Flights - Cyanotype #3 . (1911).The Alpheus W. Drinkwater Collection. http://digital.lib.ecu.edu/1388
    • 60. Bibliography
      • Fallows, Deborah. Pew Internet and American Life Project: Search Engine Use . Available: http://www.pewinternet.org/pdfs/PIP_Search_Aug08.pdf (February 5, 2009).
      • Head, Alison J. “Information Literacy from the Trenches: How Do Humanities and Social Science Majors Conduct Academic Research?” College and Research Libraries . 2008; 69:5.
      • Palmer, Carole L., Lauren C. Teffeau and Carrie M. Pirmann. 2009. Scholarly Information Practices in the Online Environment: Themes from the Literature and Implications for Library Service Development . Report commissioned by OCLC Research. Published online at: www.oclc.org/programs/publications/reports/2009-02.pdf
      • Proffitt and Schaffner. 2008. The Impact of Digitizing Special Collections on Teaching and Scholarship: Reflections on a Symposium about Digitization and the Humanities. Report produced by OCLC Programs and Research. Published online at: www.oclc.org/programs/reports/2008-04.pdf
      • Sinclair, James and Michael Cardew-Hall. “The folksonomy tag cloud: when is it useful?” Journal of Information Science . 2008; 34; 15.
    • 61. Contact
      • Gretchen Gueguen
      • Digital Initiatives Librarian
      • J.Y. Joyner Library, East Carolina University
      • [email_address]
      http://personal.ecu.edu/guegueng/readings/JustKeepClicking.ppt