Piloting linked data to connect library and
archive resources to the new world of data, and
staff to new skills
CNI Fall Meeting, December 11, 2012
Laura Akerman Zheng (John) Wang
Metadata Librarian AUL, Digital Access, Resources,
Robert W. Woodruff Library
Notre Dame University
Who has presented most frequently at CNI?
Current Model: Search and Discover
Metadata Published as Documents
Require Human to Decipher
Linked Data Model: Find
Semantic Graph Model
Machine Understands Semantics
Relevant to What We Do
Reuse, Authority Control, Knowledging
To Interlink EAD, Catalog, and Other External
Little Time to Learn Additional New Things
• Learning group – open to all
2 "classes" a month, 5 months.
• Pilot: 3 months
Brainstorming a pilot project
Team: programmer, subject liaison, metadata
specialists, archivist, digital curator, fellow.
1-3 hrs/week for all but leader
A sandbox running Linux
Integrate linked SPARQL
data into discovery
Our Own Triplestore
RDF from EAD Civil Navigation
RDF from TEI
Rosters (and MARC)
RDF from MARCXML Faculty
Data from other 1
archives Redesign metadata
creation as RDF
3 months later...
Sampling little bites of the
EAD (starting from
id.loc.gov URIs for LC
subjects and names
(starting from LC
Visualization – triplestore
A few of the connections...
of America. Army.
HTTP:Our Mobley Tom1
48th Georgia Infantry
Selecting material that will “link up”
without SPARQL, is too hard!
Even when items are in a unified “discovery
layer”, the types of search are limited.
Get it into triples, then find out!
•(No one model to follow has emerged. We
have to think about this ourselves.)
There are many ways of modeling data
ArchivesHub handles subjects:
<associatedWith><!--About the Concept (Person)--> <skos:Concept
<rdfs:label xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xml:lang="en" >Geary, John
<rdfs:label xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xml:lang="en"> lcnaf</rdfs:label>
<foaf:focus xmlns:foaf="http://xmlns.com/foaf/0.1/"><!--About the Person--><foaf:Person
<rdfs:label xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xml:lang="en">Geary, John White, 1819-
LC's MARCXML to RDF/Dublin
dc:subject " Geary, John White,
Simile MARC to MODS to RDF:
<modsrdf:fullName>Geary, John White </modsrdf:fullName>
Linked data is HUGE
It’s coming at us FAST
It’s not “cooked” yet
• We learned more by doing than by "class".
• Making DBPedia mappings or links by hand is
very time consuming! We need better tools.
• We need to spend a lot more time learning
about OWL, and linked data modeling.
• Easily available tools are not ideal!
• Skills we needed more of: HTML5, CSS,
• Visualization/killer app not there yet.
• Can't do things without the data! No timeline
if no dates!
What we got out of it
Test triplestore for training and more
Better ideas on what to pilot next
Convinced some doubters
"Gut knowledge“ about triples, SPARQL, scale
Beginning to realize how this can be so much
more than a better way to provide "search"
Outside our reach for now
Transform ILS system to use triple store instead of
Create hub of all data our researchers might want
Make a bank of shared transformations for EAD,
Shared vocabulary mappings
Social/networking aspect (e.g. Vivo, OpenSocial...)
- need a culture shift?
Build user navigation?
More Civil War triples including other local
Integrate ILS with DBPedia links?
Suite of “portal tools” for scholars?
Use linked data for crowdsourcing metadata?
Connect with others at Emory around linked data
• Focus on unique digital content
• Publish unique triples
• Reuse existing linked data
• Create standards or best
• Grow our skills
• Test and evaluate tools
• Develop tools
• Interdisciplinary linking?
• Metadata librarians - Linking association and
Connections group sponsors: Lars Meyer, John
Connections Pilot team: Laura Akerman (leader), Tim
Bryson, Kim Durante, Kyle Fenton, Bernardo Gomez,
Elizabeth Roke, John Wang
Fellows who joined us: Jong Hwan Lee, Bethany Nash
Laura Akerman, email@example.com
John Wang, Zheng.Wang.firstname.lastname@example.org
"Connections: Piloting linked data to connect library and archive resources to the new world of data, and staff to new skills". Emory University Libraries (EUL) will share their first-hand experience in interlinking Civil War related materials and other online resources, leveraging Open Linked Data principles. As library linked data are emerging, EUL planned to evaluate linked data's potential to replace current library processes and services (bibliographic services, finding aids , cataloging, and metadata work) as a more efficient and sustainable means, and understand its payoff for end user research and learning. We initally focused on workforce education and hand-on learning through real-time experiments. Over the last year, the Connections project has begun to educate staff and prepare them to work with linked data, culminating in a 3 month hands-on pilot to build and convert some data. The Pilot introduced the concept to a wide range of staff, including subject liaisons, archivists, and metadata librarians, and programmers. We interlinked our "silos" of data with other open data sources as a road to enhancing user discovery and use of our material in a very limited scale. From our experience (insights, as well as some limitations and stumbling blocks) the group developed some better informed recommendations for more education and staff involvement, as well as potential incorporation of this technology into library services and workflows. Our experience will be particularly helpful for institutions that are awake to linked data's transformative potential and are making plans. We will share our assessment on the readiness of the entire linked open data ecosystem for libraries to cross-link disciplines and the possible roles of libraries in a linked world. Beyond that, we plan to suggest some possible routes to help peers to involve their staff with this new paradigm of information curation and dissemination. " • Can linked data replace library bibliographic services? • How to start such an initiative • Readiness • Recommendations – Education – Integration – Potential areas of work/roles of librarians and staff
Reduce Duplicative Work (Downloading, editing, creating holding records) Shorten Process Time (Knowledge Linking) Enhance Authority Control (This John Wang is from Emory) Give Library a Universal Attention (Web Scale) Help Libraries Achieve Missions
Than current biblio service tool sets Not treat as an additional thing Connect to the Larger ecosystem Convert/make some RDF. Show value.
Finished ILS Migration Setup Cloud Services setup Biblio Vendor Services How to get two division directors sponsored the initiative
Now I'm going to talk a little about our experience, and some of the discoveries we made. What we've produced so far isn't that significant, compared to what some other institutions have done... but maybe this is one way to think about involving staff in learning and preparing for using linked data in libraries.
We started having classes toward the end of last year. As John explained, our library had a lot on its plate, and the people whom we wanted to involve are very busy. So we usually had brown bag lunches, every other week. This was a high-level overview... sort of “ABC’s” This is a triple, this is how SPARQL queries work, here’s what OWL can be used for, here are some things about publishing linked data that we need to think about. I worked with a graduate fellow to team-teach these. We had a core group that was asked to attend every time, but a lot of folks from across the library attended. By the time we got to brainstorming about a “pilot project” for the summer, the group had a lot of ideas...
And, we decided to try as many as we could with our pilot.
We chose to center our pilot around a small topic, The American Civil War, because we had some interesting resources and it’s the 150 anniversary... We wanted to show how linked data could link up our metadata “silos”, enhance our unique content, integrate it with data from other places (manually and automatically). Of course, that would include DBPedia but also other archives, other sources of data specific to our theme, and perhaps even data we converted from other formats. We planned to have cool visualizations like maps and timelines. Maybe our data could contribute to a faculty project on the Battle of Atlanta. Oh, and also, we were going to build an interface to create metadata as “native RDF”. We would choose which data models we wanted to use. And, we would investigate which free or open source tools were most useful for doing this work. All, in 3 months... using enthusiastic but busy people who were not the “a team” (our actual developers). Oh, but we were only going to work on a small sample of our metadata – chosen carefully ahead of time, to “connect up well”...
So, this was very ambitious, but linked data sounded so simple! We were only able to accomplish a fraction of all we’d planned in the 3 months… Investigated Virtuoso and Sesame, and also Callimachus, a new “beta” web framework; we decided to go with Sesame, it had a web client people could use to do SPARQL queries and load things. By the way, our programmer Bernardo Gomez converted a copy of our ILS database to RDF (mostly Dublin Core) – this wasn’t part of the pilot per se, but interesting exploration. Transformed a small number finding aids using ArchivesHub stylesheet as starting point (lots of modifications still needed!) Transformed subset of MARCxml for some digitized books, using LC stylesheet as starting point (experimenting a little bit with RDA vocabularies but not getting too far into it). Made some N3 triples by hand, in Notepad, to describe images with no metadata. Included id.loc.gov links and DBikipedia links. Retrieved id.loc.gov name/subject URLs via script. Our programmer looked at scripting some links to DBPedia based on our names and subjects but found this too involved to attempt for this pilot. Building a navigation interface, turned out to be a bit too much to accomplish in 3 months, but we had some adventures along the way. At the end, a power outage corrupted our Sesame/OWLIM triplestore... so no live demo today but we can rebuild it. Investigated lots of softwares (mostly free and open source) for display, navigation and publishing of linked data. Some of us were interested in using Drupal 7’s linked data capabilities to create a user interface, but we’re not sure this was the applicationwe wanted. We had not planned to publish our data for this pilot! But, came to realize, if we wanted web-based tools such as LinkSailor to navigate our data, we’d have to publish it. We had fun with a few simple visualization tools that we could plug some of the data into directly.
This sequence illustrates the kind of connections we want be able to make. I’m using sort of generic terms for the predicates, rather than any particular vocabulary, for simplicity. We go from a name/text string as a subject (in this case a person)
To a URI identifier which we came up with. With the ArchivesHub model we were able to see lots of “coined” URIs for entities such as names associated with an archive. We began to see some wisdom in having our own URIs which we can make assertions about (like, our URL identifies the same person as one from id.loc.gov) without having to assert things about other people’s data... But in this case, we don’t think anyone else has made an identifier for Mr. Mobley so we would have to. The number of URIs we would need to mint was kind of astonishing for us, one of the things we learned is, we need a strategy for this.
We can then assert that he’s a member of a particular Civil War regiment. We have “NACO Authority” strings...
From here we could link to a regimental history in our collection. And, if our URI for the regiment was linked to a DBPedia entity, we could link to whatever information Wikipedia has on it and navigate to other regiments and much more. And who knows what other data might link to the DBPedia entity?
Or, a user could explore other material in that MARBL manuscript collection, or in any other collection that had material on that regiment, or the Civil War...
So, what we learned on our summer non-vacation.... We spent too much time trying to select specific records to convert for our pilot. In the end, we loaded all our regimental histories, and a subset of our finding aids, and SPARQL query told us which ones had common subject headings. SPARQL is a skill that I think many librarians could start learning, by the way. There are plenty of SPARQL endpoints...
When we contrast ArchivesHub's "associatedWith" construction to express concepts - in this case, a person - with archives that have material about them,
With the very simjple mapping to Dublin Core,
To Simile's MARC to MODS to RDF approach... And I haven't had time to play with the conversion the BIBFRAME project has come up with, yet, or you'd get a slide of that! You can see that there are a lot of choices. What we wondered was, were there enough similarities in the relationships that we could find some common models and vocabularies across our data? That would make querying easier...
From my own perspective, it can be a bit overwhelming to follow linked data just now... so much to learn, so much happening. But I think you just have to dive in. Since this has been a major focus for me this year, and I've followed so many email lists and tried to keep up with projects here and in Europe, and am starting to feel a sense of urgency about this - we need to be on board. At the same time, there are aspects of the "how" that are still unclear and difficult - provenance, tools, vocabulary mappings are just a few.
So we learned a lot more in the pilot than in class. People were more engaged because they were doing an assignment they came up with. Those of us that worked with DBPedia could see real possibilities, both as a means to link to Wikipedia content and as a vocabulary in itself. String matching LC subjects didn't work very well - this needs to be a larger project - maybe it's already happening? using algorithms but also, we think, some human review. By larger, I mean, community effort. Who's doing it?
-- We had challenges finding, and using tools (especially the non-programmers); how do we find what’s new, what’s good, what do we need to build ourselves? I had been warned that there weren’t really good tools for non-programmers out there... but it was interesting, how many new tools appeared in beta just in the 3 months we were doing the pilot. We just ran out of time to try them all. The pilot made some of us painfully aware that our web skills were out of date. Most of the group expressed regret that they didn't have more time to get involved in this project, but felt they got a lot out of it anyway. In our discussions we recognized a conflict between the desire to create more of our metadata as data, to provide more hooks, and the reality that we have limited staff working at capacity... we talked about crowdsourcing. We also need to explore how this would change the tools we are using to create metadata. Is it possible to make it easy to make more links?
One of our members said at one point, "this is really like a relational database, just not with tables" and from his perspective there's some truth to that, but we are starting to see that we can do way more with this than replicate what we're doing with relational databases, MARC, and XML. Linked data is not just all about “search”. We can make discoveries about our collections as a whole, but we can also link our content to the "things" it relates to and really weave them into the research environment for scholars, such as the articles appearing in UniProt and other scientific databases... As we look back, although we don't have a killer app yet, we've gotten a lot out of the last 3 months. We have our test triplestore and can begin to expand it bit by bit towards realizing some of our grand schemes... but we also have other ideas.
We also get a sense of our limitations. Some developments really call for big communities, maybe global effort. Who is going to host banks of shared transformations and vocabulary mappings? Some of us are interested in the social tools but our library isn't ready for that right now, however we can begin to feel out our faculty and students about their interest.
Our sponsors haven’t made decisions on where we go next. I’m pretty sure we’re not in a position yet to invest more staff time, but: I think many of our original ambitions for the Civil War pilot could be achieved if we can continue at a slow pace, one step at a time. There’s also some interest in at least demoing, interlinking our Primo discovery layer with DBPedia. We want to continue learning and broaden the participation of staff at the library, coordinate with more people in our Systems division - We know of at least one faculty digital scholarship project that our programmers are involved in that uses linked data, and we'd like to open our information sharing group to others at the University.
Management (Learning and experience) Technical and learning aspect: Different publishing methods Technology readiness Ecosystem readiness Users perspective on what they get from library and how they might use the data Who should learn, in the conversation
(enable linking and creation of linked data)
Given the “infrastructure” of global LD isn’t “mature” yet, why not wait for big players to sort it out? What can we do now? (our project was an attempt to answer) Our library is “pinched” for staff time – what can we do? Who in your organization do you get involved in learning/transition, and when? (our project started from systems and “tech services” but public services folks came in and we discovered we need them! Everybody!) Is LD only “big data” – or is small data a part? How can we get data (metadata) in RDF when we don’t have it? Standardization? Who decides? Tools for everyone! Who will build? Where is the community? We need X..... Big jobs – e.g. linking LCSH to Wikipedia Share info on tools (dlf Zot group – no traction – what would work?