13. Project Requirements
Do not assume that any step will be problem free!
1. Every Resource record is linked to its MARC record
2. Every subfield 0 match is accurate
3. Verify that each match can be downloaded / imported
4. For each record pair, ensure that the headings match
Do not assume that any step will be problem free!
15. Near-Match Issue
• Ford Madox Ford != his maternal grandfather
600 1 0
$a Ford, Ford Madox, $d 1873-1939 $0
http://id.loc.gov/authorities/names/n810502328
16. Solution
• Two parts:
1. Compare authorized name with name string
2. Check for multiple subfield 0s
17.
18.
19. Summary
• We are enhancing data in two, local systems
• We want to connect to external systems
• We want our description to be recognized outside of
our domain
• URIs are the first (not straightforward) step
• It’s not about links, but the potential for links
• Once connected, the network changes
20.
21.
22.
23.
24.
25. Code created; code shared
• MARC XML analysis:
https://github.com/fordmadox/xquery-scripts
• Authority download and ASpace Linking:
https://github.com/mark-cooper/authorizer
Editor's Notes
Good afternoon, everyone. As Karen mentioned, I will be going into a bit more detail about our efforts to enhance Yale’s legacy and current archival description by associating URIs with name and subject headings. To do that, I have decided to frame my talk around Fernando Pessoa.
There are a few different reasons why I have selected Fernando Pessoa, represented here in the Social Networks and Archival Context interface, but the most important reason is because of Pessoa’s proclivity for creating and often writing as a variety of heteronyms – over 70 throughout his lifetime.
Pessoa preferred the term heteronym to pseudonym, since, as he said himself, his heteronyms were “authors to whom he served as literary executor.” Many of the names you see listed here in SNAC, which can be accessed in the interface by clicking on a link labelled alternative names, have been described as different writers, different people, which also begs the question, today, as we undertake Linked Data projects, if those different names require different URIs.
In SNAC, there is one URI for Pessoa. In the Library of Congress Name Authority File, however, some of his heteronyms have their own authority records. An authority record for Ricardo Reis was just created last year, for example, in LC’s database. And even though there is only one Wikipedia entry for Pessoa in the English-language Wikipedia…
…what you see on this next slide is an entry for Ricardo Reis in the Portuguese-language edition of Wikipedia.
There are also stand-alone URIs for some of Pessoa’s heteronyms in Wikidata.
In fact, if you view his entry in Wikidata, as seen here…
You will find a series of statements that are grouped under a heading that’s labeled “said to be the same as”. Here, you will see an entry for Ricardo Reis and others. Each of these entries is a URI. There is also a URI for the term “heteronym” in Wikidata, which is how all of these same-as relationships are characterized. Furthermore, there is a URI for the concept of “said to be the same as” in Wikidata. At this point, we are starting to go down the Linked Data rabbit hole. I don’t plan to do that in today’s talk, so instead I would like to pull things back for a moment and provide a concrete example of the value in adding URIs to archival description.
On this slide, I have included an image of the recently-released Digital Edition of Pessoa’s writings, which is a collaborative project undertaken by the New University of Lisbon and the University of Cologne. This site currently contains, among other writings, all of the poetry published by Pessoa in his lifetime.
Here, for example, is an encoded transcription of one his of his poems, attributed to his birth name, alongside a digitized copy of that same version of the poem.
Now, as for the connection to the project that Karen and I are reporting on today, because Pessoa is one of the agent records in our ArchivesSpace database, we will be updating our record for him with a URI.
And here….
…is how that Agent record looks in the development version of our ArchivesSpace Public User Interface. The only reason we have an agent record for Pessoa is because the Beinecke Library acquired a draft of the same poem that I just showed you in the Digital Edition website, which is represented here on the screen by the single search result.
Of course, our local description would probably never go to the lengths of providing a researcher a link to an encoded version of the poem hosted elsewhere. But what happens when we add URIs to our description?
When we use URIs, we create potential connections. And these potentialities can be realized without us doing anything else. How? Well, as you see in this slide, imagine that the Beinecke has added a URI to the LC Name Authority File. Also imagine that the Digital Edition website has done the same, but instead of using a LC URI, they have used the authority ID from the National Library of Portugal. So now have two IDs that aren’t the same.
But because of Linked Data services, such as Wikidata, potential connections exist. Anyone can connect these two IDs, at the time of need, because Wikidata records them both. In fact, in addition to the LC NAF ID and the National Library of Portugal ID, the Wikidata record for Pessoa references 48 other IDs, all of which refer to Pessoa, in different systems, different languages, all over the world. In other words, we can provide a solid foundation for connections without even having to communicate with other description providers.
So that’s why we’re adding URIs, and that’s also why I hope that everyone else is adding URIs or considering to add such URIs to their archival description. But Fernando Pesssoa and his heteronyms are also why adding URIs is not a straightforward process, let alone describing relationships amongst those URIs
But all projects have to start somewhere, so that’s why I’ve started with the example of what it takes to enhance a single name string with a URI. When we started our project, I did not have a grasp on how many headings we would be updating, but now that we have gotten to this point, I can tell you that we have added exactly 31,665 unique URIs to nearly 10k finding aids. And along the way, we’ve made mistakes, so next I just want to talk a little bit about how we reviewed our work.
When it came to quality control, we had four general project requirements. First, we had the task of connecting each ArchivesSpace record with its corresponding MARC record. Simple, right? Well, we had a few issues here, such as the wrong links being made accidentally, as well one issue where a single collection had so many access points that it was split, long ago, into two MARC records in our ILS, whereas we have a single Resource record for that collection in ArchivesSpace.
Next, we wanted to ensure that every subfield 0 that we added during the course of the project was accurate. Most of the subfield 0s were added automatically, by Backstage, and only when the name string was an exact match with the primary heading from an authority record. In all of those cases, we had to hope that the archivist or cataloger added the name string correctly in the first place. We also added a much smaller subset of URIs manually, when our group reviewed the “near match” reports provided by Backstage, as Karen already mentioned. And, in my experience, whenever you have more than one person doing more than one thing manually, you are going to get a variety of errors, so you have to check the results.
Third, and this was a simple one since we had LYRASIS do it for us, we had to make sure that for every subfield 0 we added, we were able to download its authority record from LC or the Getty. And that’s just a simple numbers check.
Finally, we have to verify that all of the headings that we have in our ILS are also in ArchivesSpace and linked to the exact same descriptive records. This is also basically a numbers check, but it is a bit more nuanced since ArchivesSpace does not align one-to-one with bibliographic description. For just one example, a Meeting Name in MARC is mapped as a Corporate Agent record in ArchivesSpace, making it indistinguishable from other Corporate Agent headings. Not a travesty by any means, but it makes our last stage of verification thornier than I would like.
The important takeaway, though, is that we had errors at every stage in the project that we needed to correct, so next, I am going to show 3 examples of errors encountered….
URIs are opaque, so everything looks fine here…
Wrong URI, Right Name (sort of, as FMF was named after FMB).
In this case, we caught the error since the MARC record eventually had two subfield 0s. The reason: we sent this record to Backstage twice during the course of the project, and when it came back the second time, it had two subfield 0s. The second was the correct one; The first one was for FMB. So, we removed the first one! We also checked all of our matches with the authorized headings to ensure that there weren’t any other blatantly wrong matches.
Getty responded within hours to apply a fix so that we could download this record. We also had to report a handful of issues to LoC during this project, when we discovered that same records were available at authorities.loc.gov but not available (due to indexing issues?) at id.loc.gov.
Don’t use undifferentiated records. We had 135 matches to these types of records, and once we discovered that, we removed those subfield 0s (but not the headings) from our records. In this case, the record is for a G.B. who published a book in 2015 as well as a G.B. who was a 19th century musician.
All our finding aids, subject, and agent records (now represented as URIs) from this project, ingested as a graph with Gephi.
43,109 nodes: 51% agent headings, 26% subjects, 23% finding aids
137, 844 connections / edges: 40% are 650 topical headings (orange lines), 22% are 600 and 700 headings (pink), and nearly 20% are geographic headings (green).
Oft-used subject heading in the Beinecke.
J.B., here in isolation, but she is / would be much, much more central in other graphs (even this graph, if we described relationships among people, in addition to material-to-people relationships). And the point here is that once we add URIs for these entities, we create that potential.
Like Baker, another very isolated agent record in our graph…
But the underlying metadata, seen here through Google’s eyes, provides one possibility for (re)connecting Pessoa outside of our “Archives at Yale” graph.
The first link includes a few scripts used to review our MARC XML records before sending them to LYRASIS (checking for typical issues we encountered, like a URI link in a subfield other than subfield 0, etc.)
The second link is the amazing set of tools developed by Mark Cooper, at LYRASIS, which downloaded authority records (from LC and Getty), gets them into ASpace, and links those authority records back to our Resource records (by means of our “bib ids” from our ILS).