Serendipity in Digital Collections: Enhancing Discovery with Linked Data Anna L. Creech, Head, Resource Acquisition and Delivery, Boatwright Memorial Library, University of Richmond
2. Linked Data in Libraries
• Bibliographic metadata links authors,
works, and subjects.
• Authority control insures that those
links remain consistent over time.
• RDA and BIBFRAME seek to expand
bibliographic metadata in new
directions, leading to better linked data.
3. Overview of Web-Scale Discovery
• One central index of a library’s
collections, regardless of format.
• Capable of incorporating metadata from
the library’s digital collections.
• Links to full-text and audio/visual
content, but does not host this content.
4. Improving Discovery with
Linked Data
• Provide an alternative to OpenURL,
which often leads to dead ends due to
incomplete or incorrect information.
• Incorporate background or related
information in search results more
effectively.
• Provide more serendipity in online
searches.
5. Overview of Library Digital
Collections Platforms
• DSpace and CONTENTdm were great
tools when we started digitizing, but
they haven’t scaled up.
• New commercial systems and open
source tools offer the promise of better
metadata frameworks and linked data.
6. Enhancing Discovery of Library
Digital Collections
• Commercial products are closed
systems, but they could be opened more.
• Hydra DAMS support linked data.
• We need more open data sets to link to.
• Library digital collections need to be
created with linked data in mind.
7. Thank You
“Chainmail” by Ben Rogers, (CC BY-NC
2.0) https://flic.kr/p/nyctg5
Shana McDanold, Michael Giarlo,
Stephen Francouer, Chris Kemp, Tom
Campagnoli, Catherine Clements, and
Leigh McDonald were very helpful in
putting this presentation together.
This talk will discuss what we are doing and what we could be doing to make our collections more discoverable through linked data.
Bibliographic metadata began as an analog attempt at linked data, and our current systems have enhanced that well beyond the capabilities of the card catalog. Most library catalogs will link author’s works and subject headings, at the very least, and some will also connect other versions of the same work.
The main reason why these links work is that we have established some level of authority control over the metadata. This means that anyone, anywhere, at any time will be creating metadata using the same guidelines. There are, of course, local variations, but they tend to fall within acceptable parameters. As a result, metadata connections within a library collection remain consistent over time, and generally are compatible across collections.
Full disclosure: My years as a serials cataloger were short and more than a decade ago, so I have relied heavily on my cataloging colleagues to keep me informed of the broad changes that are happening in the realm of bibliographic metadata. But, even back then, I was hearing lots of discussions about FRBR -- Functional Requirements for Bibliographic Records. FRBR is the conceptual model for creating metadata that can be sliced and diced in ways beyond what MARC alone is capable of.
RDA -- Resource Description and Access – is the content standard developed under the FRBR model. It is container agnostic, so you can use the rules of RDA to create bibliographic metadata for all sorts of metadata containers like MARC or Dublin Core, depending on what it is that you are trying to describe, and therefore expand the kinds of things included in the library catalog, provided the library catalog system can index more than just MARC.
BIBFRAME – Bibliographic Framework – is designed to supplement MARC standards and use linked data that can be applied both inside and outside of the library community. BIBFRAME won’t replace MARC entirely, but it will allow us to stop stuffing things into MARC that weren’t intended to be indexed that way, and stop asking MARC to render information in ways it wasn’t designed to do. BIBFRAME focuses on identifiers, making it much more useful in the semantic web.
Currently, BIBFRAME is in the most nascent of nascent stages. It’s being spearheaded by the Library of Congress, and while it’s theoretically compatible with FRBR and RDA, there are some key differences. BIBFRAME contains four main classes, which are not entirely analogous to FRBR: Creative works, Instances, Authorities, and Annotations. It’s also quite monograph-oriented at this stage, and although it includes a serials entity, there are many kinks still to be worked out. Serials are never easy to deal with, so this is no surprise to me.
BIBFRAME takes the MARC record, separates it out in into its components, regroups those components, and then allows semantic web architecture to recombine the components for the user in whatever way makes sense, including remixing and mashing up with other datasets.
It’s the annotation class of BIBFRAME that makes it exciting for me. The annotation is where local library holdings would be noted, as well as related resources, cover art, reviews, summaries, table of contents, and potentially any other notes or linked data we were unable to include in MARC.
Web-scale discovery systems are several years old now, and most of us are familiar with them, so I won’t spend too much time on this. While each one has unique features that distinguish it from competing products, they all share characteristics that distinguish them from previous attempts at such an endeavor.
Unlike federated searching, which attempted to pull together all of the library’s indexes into one central search, these are pre-indexed collections of metadata drawing upon the library’s catalog records and metadata sources comparable to the library’s bibliographic indexes and full-text online collections. In some instances, this central index includes locally created digital collections and institutional repositories. De-duplication of citations across all of these sources is attempted and mostly successful. Though the search results are generally not as fresh as federated search results, they are generally retrieved faster and with fewer duplications.
The discovery systems attempt to enhance the metadata retrieved through methods such as grouping like formats, suggesting other resources for more refined searches, providing controlled vocabulary suggestions, and giving brief introductions to topics by pulling in encyclopedia entries alongside the search results. While some attempts are made at providing direct links to the content indexed, most discovery systems end up relying on OpenURL at some point.
OpenURL is a standard developed in the late 90s, and it uses bibliographic metadata from the source (usually a citation) to link to a target identified by a localized knowledgebase of library holdings. The target could be the full-text of an online resource, or streaming multimedia, or a bibliographic record of a print resource in the library’s physical collection. NISO’s KBART working group and IOTA project are attempting to improve knowledgebases and OpenURL tools. OpenURL links are still prone to errors on both the source and target ends, highlighted even more so by these web-scale discovery systems making more content visible to users.
In doing some research for this presentation, I found myself relying heavily on Google Scholar, since our collection of library science publications is not as robust as that of an institution that focuses on that kind of scholarship. If I used the OpenURL links, I was more often than not sent to a dead end of content we do not own. However, Google Scholar’s links would frequently send me to the pre- or post-print deposit of the article in question in the author’s institutional repository, a service our OpenURL and Discovery provider is unable to offer at this time. If they could, I suspect that some of the frustrations of our users would be reduced.
Linked data can enrich the user experience by providing context for the sources they are finding. Summon 2.0, for example, pulls in encyclopedia entries for key words from the search terms when possible and displays them on the sidebar to provide background information. If the hover-over that now displays only the citation data in the sidebar would also include information about the authors of the work such as their affiliations, other works, ORCID, and bio, the researcher would have vastly greater information to place this citation in context. This could start by making use of the VIAF, which “matches and links the authority files of national libraries and groups all authority records for a given entity into a merged ‘super’ authority record that brings together the different names for that entity.”1 With an established authority record, it should be simple to ensure that the data pulled from other sources is accurately reflecting this author and not another one with a similar name.
Similarly, linked data could be used to connect key terms to established thesauri, and provide researchers with suggested search terms or search results that they may not have considered, which is especially useful for novice researchers and those working outside their fields of study.
Linked data could also draw upon altmetrics and the suggestion tools found in online retail markets. For example, a citation for an article could not only provide links to articles cited by or citing, but also to popular articles downloaded or viewed by readers of the same article.
Ultimately, we need to move beyond a search results page with simple listings of items sorted by relevance or date to displays that puts each item in some sort of context for the broader body of work that inspired and inspires it. Google is already doing some of this with the Knowledge Graph (https://www.google.com/insidesearch/features/search/knowledge.html). Library discovery systems need to keep up, or else they will be left behind as more and more researchers turn to Google and Google Scholar instead of these very expensive search engines.
1. "VIAF." VIAF. Accessed December 02, 2014. http://www.oclc.org/viaf.en.html.
In the beginning there was DSpace and CONTENTdm, both of which are still alive and kicking, but are slowly being replaced by… what exactly is to be determined, but there are some options out there. Libraries are begin developing digital asset management (DAM) plans rather than haphazardly digitizing content and storing on servers seemingly at random, with or without associated metadata. Along with those plans they are either using existing commercial platforms or developing their own with a combination of open source tools.
If you are looking at a good review of exiting digital asset management systems, I suggest reading the final report of the University of Utah's DAM Review Task Force (http://www.mwdl.org/events/DAMS_options.php). They assessed the options, created a list of criteria, and then rated each product. Then they did a SWOT analysis of the top open source option versus the top commercial option, based on their ratings. Of course, the needs of your institution may be different, but this at least provides a roadmap towards developing a digital asset management plan and system.
The UC San Diego Library Digital Asset Management System (DAMS) is a thoroughly fleshed out example of a locally developed system (http://libraries.ucsd.edu/about/digital-library/dams.html). The open source tools they are using allow them to ingest all current file format types and associate appropriate metadata with each. The front-end of the system uses Hydra and Blacklight for discovery.
For more of a turnkey solution that does not require a programmer, libraries are using commercial products like Digital Commons and ArtStor Shared Shelf. Like the open source and homegrown tools, these products can handle all current file format types, but the local customization is limited.
Regardless of what digital asset management system you choose to use, the important thing is the data that is going into it. Have a plan for how you are going to handle things like authority for names and places, and what to do when there isn’t an authority file for them. Have a plan for how you are going to preserve and move the data as systems and needs change over time. You may think that the best thing to do is to just start digitizing content, but you’ll regret that move when you have to go back later to reconstruct the whole process in order to incorporate appropriate metadata and links.
Currently, library digital collections seem to be developed outside of the context of the rest of the library’s collections, with separate metadata standards and storage systems. In order for us to move towards better discovery of these resources, we need to be able to create and associate appropriate metadata with each file, and then include that metadata in whatever discovery system or library catalog we are using. We also need for these collections to be visible to indexing tools like Google Scholar, using linked data schemas that are broadly recognized.
Hydra supports linked data, and will soon be using Fedora 4, which natively supports RDF (resource data framework). This means that it can incorporate information from existing open data sets like GeoNames. If you point an object’s “spatial” property at a GeoNames URI, when you fetch that property it will auto-populate with the map attached to that URI.
There are quite a few existing open data sets like GeoNames. DBPedia, for example, is a dataset of concepts extracted from Wikipedia and includes 11 different languages. VIAF that I mentioned earlier is another. However, the number of these datasets is small, and their content focus is often quite narrow. My hope is that as library digital collections make use of linked data to enhance their content, they too will be used as datasets to enhance content elsewhere, so as we build these collections, we should be keeping this in mind.
At my institution, we’ve been working on digitizing the papers of David Nelson Sutton, a prosecutor during the Tokyo War Crimes Trial. The papers themselves have now all been scanned and OCRed, and we are in the midst of the more challenging task of enriching the metadata and the documents themselves. Names and places are linked to authority files wherever possible, which is particularly important as the spellings of Anglicized names has changed over time, and locations in some disputed territories are referred to in both Japanese and Russian, which is often not the name given the place now. When this project is completed, the thesaurus and indexed data will be available to be linked to by other collections from other libraries. We know that we are not the only institution building digital collections around documents from those trials, so this will be an important piece of connecting all of the documents for researchers.
Libraries have a long history of describing things with controlled vocabulary and standards, and with long-term discoverability in mind. We are uniquely positioned to make linked data and the semantic web a reality for scholars and researchers everywhere. However, we are generally lacking in three essential pieces to translate our knowledge and experience to the actual implementation: money, time, and programming skills. With the right set of tools and institutional support, we could do anything, but unfortunately most of us are not in that position.
We are also still firmly in the world of closed system metadata and tools. Even though RDA is out there and being used to create records, we’re still cataloging with MARC and our public interfaces are rendering those records in pretty much the same way they’ve rendered them for decades. BIBFRAME could change all of that, but in the mean time we have things to catalog today that must be used with today’s tools. The world of bibliographic metadata is shifting, but at a snail’s pace. Or a cataloger’s. There is both a strong desire to “fix” bibliographic metadata and a strong fear of change running through all of the discussions related to this.
The semantic web is happening whether libraries are there or not. As the institutions of historical knowledge, it is imperative that we be involved in is development as well as its preservation.
Contact information:
Anna Creech
acreech@richmond.edu
@annacreech on Twitter