Thanks for having me. I don’t often get to speak outside of California and I feel fortunate that you were having this Forum when I happened to be local. I think Linked Data is a topic that’s difficult to wrap your head around. It’s straightforward to explain the concepts involved, especially if you have a basic understanding of how the web operates. Where it gets complicated is in it’s application . Yeah, this linked data stuff has been floating around the web-o-sphere for almost 10 years. Where do librarians and libraries fit in? Should you care? I definitely think so, or I wouldn’t be here. Linked Data isn’t as scary as it sounds.We’ve got a massive amount of metadata poised to take the Linked Data world. We’ve got a lot of work to do if we’re going to take advantage of the opportunities Linked Data affords us. I hope that you’ll walk away from my talk today with a solid grasp of what LD is, the benefits of jumping in, and a desire to test the waters and develop a pilot project for your institution. If I can do it, you certainly can do it too.
Michelle tells me that we’ve got to adhere to some guidelines so the public librarians in the room can use this session for continuing education credits. These are your observable and measurable learning goals so we can be absolutely clear. By the end of this talk you should be able to do these bullet points.I’ll summarize these points before we wrap it up. We want to be as explicit as possible so you can get your credits. All the continuing education credit answers in these slides will be highlighted in red text. Plus these slides will be made available to you. I don’t want you to have to spend your energy writing it all down for yourselves. Instead I’d appreciate your attention and your questions as we run through this. Personally, I haven’t found it easy to comprehend the implications of Linked Data and I’d rather help you avoid this pain. I’d like to address your questions as they occur to you rather than waiting until the end.
Before we get into this any further, I’d like to ensure we understand the terms I’ll be tossing around and how they’re going to be abbreviated in the slides. Most of these I’ve taken from linkeddata.org , a simple web site authored by Tom Heath from Talis as a service to the world wide LD community. It’s a good launching point for learning more and I encourage you to explore it.
Linked Data is simply set of best practices for publishing and connecting structured data on the Web
It’s important to differentiate between Linked Data and Linked Open Data. It’s better if you can make Linked Open Data because the whole idea behind Linked Data is that you want other people to use it. Open licenses mean your data is probably going to get more use. . The types of licensing for open data are an ongoing point of discussion. Records like those in WorldCat have complex ownership and can’t determine provenance so well . JISC Open Bibliographic Data Guide suggests CC-BY for copyrightable material, ODC-PDDL for factual data or CC0. Avoid prohibiting commercial re-use as relationship between commercial and non is complex (see Library Thing).
A side note now that you know the acronym for LOD, you need to know the acronym LOD-LAM. This isn’t an acronym that plays heavy in this presentation but it’s one that will be helpful for you in exploring LD after you leave today. The 1st International LOD-LAM summit happened this past June with lots of really knowledgeable people. They continue to communicate and spread LD love. There is an excellent twitter hashtag for staying current #lodlam , and following the work that’s coming.
You already know URLs. URLs are a form of URI. Probably the most common form. URIs can be expressed with other syntax/protocols, but this is pretty much what everybody uses.
Is anybody totally unfamiliar with RDF? Resource description framework is a data model.
It describes relationships based on triples. Triples are also known as RDF Statement. A statement is just like a statement in English. A sentence. Subject-Predicate-Object . The subject and predicate must be URIs. The object can be a URI or a literal string. Be careful not to confuse RDF, which is a data model, and the syntax in which RDF can be presented: RDF/XML or RDFa (for HTML documents), Turtle (or Terse RDF triple language), RDF/JSON
Pronounced Sparkle. The way that computers run queries to pull information from Linked Data and combine it with other Linked Data. Now that we’ve got the basic terminology down we can explore it in a bit more detail.
We’ve seen the brief definition of LD. What is it telling us? It’s telling us about something we already know: structured data. Librarians have been structuring data with MARC for over 40 years. The biggest thing you need to remember about LD is that it’s meant to be machine-actionable. We’re not creating LD directly for end-users. We’re creating LD so clever technical people can suck it into their web applications and create interesting and useful information interfaces for end-users. SPARQL is the way that they do that automagically. The other key thing about Linked Data is the follow-your-nose phenomena. On the web of documents one can surf link after link. On the web of data the same thing applies. Machines can continue following the links. The deeper mining of information will provide more context to whatever data an application chooses to display. Since meaning and relationships are embedded in the data one can ask more complex questions of the information. Ex. counties in Maryland with more than 10,000 inhabitants even if no comprehensive list exists. The machine can track the scent and to find the related facts and mash them up together. It’s like APIs. We’re familiar with the concept of mash-ups. Unlike mash-ups where you’re piping together info from a bunch of different APIs, you’re pulling together stuff from the full web itself.
We can think of Linked Data as the web on steroids. The web as we know it is a structure of linked documents (or document-like objects). Perhaps some diagrams might help you visualize it.
The web of documents that we know, links resource to resource to resource
On the web of data, you’re connecting discrete facts to other facts. What you’ve got here is some nodes and some edges (lines) or arcs. The edges represent relationships between two nodes of data. The relationships are what’s important. These “graphs” are what get represented in RDFA single node/edge/node is called a “triple” in RDF parlance. Computers combine the graphs to create new sets of information. It’s done dynamically using SPARQL – the RDF query language.
There are particular rules for creating these graphs. Each resource/relationship/resource is a sentence, and like a sentence it has a grammatical structure. The difference between triples and colloquial sentences is the use of URIs rather than words. Still, we have structured information here.
We’re definitely used to structured data with a well defined syntax. Let’s take our traditional MARC record. Because we know the code, we know that I am the author of a work called Women in Science. This relationship between me and my work is implicit. Because it’s implicit, machines can’t understand it. Machines are dumb and they don’t do inference very well.
When we do Linked Data we make the implicit explicit. We write the triples. Each triple is a sentence. The relationships are made very clear with the predicate URIs.
This is the sentence content expressed as URIs. Relationship URI are done via controlled vocabulary/ontologies. In this case, Dublin Core. DC is familiar ground. This is something librarians know a thing or two about!
When we put it all together this is what we’ve got. The URI http://viaf.org/viaf/171972263/ can unambiguously identify me. It is a “hook” by which applications can go out and get more information on me. And we’ve got our Dublin Core vocabulary pointers and RDF syntax and schema pointers so the computers know what we’re talking about in terms of how all the stuff in the RDF description works. Up until now, this new way of dealing with metadata looks vaguely familiar. What we’ve got here, looks a lot like a record. BUT!!!!
There’s always a big but. We have to get over the mind set that we are working with records when we’re dealing with Linked Data. Each bit of data in our traditional MARC record has the potential to be its own little useful ingredient in somebody else’s application. We’re going to be carving up our bibliographic record into chunks. Even if we find it useful to group our chunks into a full record stew, nobody but librarians will care about “records” And, it’s entirely possible that when we create the stewy records that only we care about, some of the chunks are going to be Linked Data which we’ve “borrowed” from some external vocabulary of URIs. I think that this is really difficult for traditional catalogers, even though we’re used to link outs from bibliographic to authority data. Diane is my metadata hero. She’s been patiently explaining this to catalogers for most of this decade. And I’ve seen her get a lot of push back in venues like CC:DA and NextGen cataloging lists. Diane has been walking the talk. Her RDA vocabularies work is precisely so that you can grab the external vocabulary and put it the relevant chunk of your ILS record.
Now we know what Linked Data is and what it looks like. Sort of. We understand there are graphical nodes of factoids written as URIs and combined into triples. How does this translate into something you would personally use on the web?This is difficult. The Linked Data web is going to look pretty much like the web as we already know it. It’s also difficult because there aren’t a lot of killer applications out there where end users can see Linked Data in action, especially when it’s used for dynamic sites where you can do the type of focused questioning of the data that Berners-Lee uses when he describes the potential of Linked Data. SPARQL end points are NOT user friendly even if they do have a public interface for people vs. a page web apps can point to. And, setting up the SPARQL end point requires some technical chops. There are some live examples of what LD would look like to Joe Schmoe surfing the web. I’ve already mentioned the VIAF as an example of Linked Data in action because it makes URIs available for each unique authority record. The VIAF web site itself is a Linked Data application which dynamically generates link-outs to Wikipedia pages and WorldCat Identities.
The Open Library is an example. See an author pagehttp://openlibrary.org/authors/OL18236A/Alexandre_Dumas_%28p%C3%A8re%29The author page is not hand coded but created on-the-fly. We’re somewhat accustomed to fly pages when our web sites are database driven. The difference here is that the entire web is now our database. The resulting page here is relatively static. You’re not structuring queries like show me all books by Alexandre Dumas which have colored illustrations and are translations. One of the big dreams of the Semantic Web/Linked Data is that we’ll be able to ask these complicated types of questions and the computers will find the disparate info and compile it for you in one spot. The data needn’t live together. This static page doesn’t do those sophisticated queries and there aren’t many web sites which have easy ways to do them. You’d need to know how to write a SPARQL query and know which data sources to point at.
Thinkbase is another example of an end user Linked Data web page. In this case it’s a bit more dynamic since you can visually browse the relationships between entities and you’re following your nose. If you want to see more examples there are lots of them available on http://thedatahub.org/ or you can see some library specific ones listed in the final report of the W3C Library Linked Data Incubator Group All of this seems like a lot of work. And yeah, it looks really cool and it’s fun to surf around . But why would librarians ever possibly want to dive into doing metadata the Linked Data way? What’s in it for us?
Sharable/Reusable – means that libraries don’t have to create their records in standalone isolation. Different kinds of data about the same thing can be produced in a decentralized way by different actors then aggregated into a single graph. Everybody can use their own metadata format since all the triples are pulled together. There’s no harvesting – stuff is available out there on the web (caveat – id.loc.gov cache, Okerbloom) . Extensibility – won’t be stuck using just ILS to manage metadata, can use industry solutions thus wider choice of vendors. Wider choice of developers since not stuck with those in library niche. Vendors not stuck using library standards (MARC, Z39.50) but can use standard web protocols (HTTP, RDF)Limits redundant work. Everybody creates only the data they need and retrieves already existing information. Discoverability – more machine friendly, data used in more places. New types of navigation can be built on top, like ThinkbaseAlso ensures that data is current. And besides sucking stuff in, pushing stuff out improves the visibility of the Library
Enhanced publications – citations as easy as citing a URI, automate retrieval of works cited in a paper . Library data fully integrated with research documents & bibliographies New modes of research which can only be done via linked dataSeparates semantic meaning & relationships from the structured encoding of the data (i.e. MARC) - makes maintaining your local systems a bit easier.URI persistence means you can track down the original data whatever its originCaveat: Linked data refers to technical interoperability. The data may be behind a pay or permissions wall. What we want to use in the Library world is linked OPEN data – data which can be freely used.The benefits show it’s really obvious that metadata librarians are going to have to suck this up and learn how to do Linked Data.
Even LC is doing it. From their Bibliographic Framework Initiative General Plan“The semantic web and related linked data model hold interesting possibilities for libraries and cultural heritage institutions”“(the bibliographic framework environment should have) Accommodation of textual data, linked data with URIs instead of text, and both” New ILS systems are being developed which will take full advantage of Linked Data – like eXtenstible catalog, Kuali OLE being developed with this in mind. And really forward looking bleeding edge libraries are already doing it. For example the Norwegian University of Science & Technology (NTNU) is incorporating LCSH linked data into their cataloging AND they’ve been trained to catalog directly into RDF, albeit using some web forms created by their programmer. I really urge you to look into the work of Rurik Greenall if you’re interested in the details on that.
Knowing what Linked Data is and why it’s good for us is one thing. How do we go about doing ? First, you need to have the pre-requisite skills under your belt. Run, don’t walk, to your nearest preferred type of training.There are some basics you need to understand before you can even begin. You must know how the web works, understand HTML, understand RDF and how it’s serialized (or written out) in RDF/XML or Terse RDF Triple Language (called “turtle”) a subset of the Notation3 (N3) language . A full tutorial on how-to is way beyond the scope of this presentation. And honestly, I’m still teaching myself how to do this. . As many times as somebody explains RDF to me, I haven’t really had the opportunity to use it in my day-to-day work, so it’s still an abstract concept. We’re starting out very slowly at Caltech with a pilot project in order to wrap our heads around itWhat I’m going to do is give you a general overview of the how-to. When you’re ready to move beyond the first couple of steps, you’re probably going to have to work with a programmer or climb a steep learning curve. I’m at that stage where I’m seeking out hands-on sessions with knowledgeable techies. See for example upcoming CODE4LIB preconfernceby Dan Chudnov.
This is Sir Time Berners-Lee basic instruction on how to do linked data. Think of the Berners-Lee principles as 4 steps you need to take in order to do Linked Data.
This is another way of visualizing the process. I took this from Tom Heath & Christian Bizer’s book “Linked Data: Evolving the Web into a Global Data Space” which I highly recommend. It’s freely available on the web and I find it pretty readable at my level of technical expertise. Berners-Lee steps 1 & 2 (Use URIs as names, make them http URIs) are covered in data preparation. Steps 3 & 4 – providing useful information using the standards & Including links to other URIs is covered between data storage and data publication
It will probably make the most sense if I use this to walk you through the process that we’ve been using at Caltech to create Linked Data. We’re a small organization with 412 currently active faculty. We decided that our pilot project will be expose our faculty names as Linked Open Data. We felt it was a good choice because that number of names was manageable. We also figured that we would be able to use the end product within other applications. I’m sure you’re well aware of the name ambiguity problem within scholarly communication. As a research intensive organization, it will be important for us to have tools for tracking the publications of our faculty. We can foresee using author name Linked Data in our institutional repository . In fact, the recent release of ePrints software includes Linked Data capacity. We want to have our Linked Data ready to go when we get our repository upgraded . We’re currently finishing up the data preparation stage with the Caltech faculty names linked data pilot. We procured a list of all 412 currently active faculty members and ensured that we had authority records for them in the LC/NAF and by extension in the VIAF. LC and VIAF provides URIs for each personal name in its database. If we get a list of those names/URIs, we’ve actually accomplished Berners-Lee step 1 & 2 since these URIs are HTTP URIs
This is where we’re currently at with our project. We have a list of faculty names and we created authority records. We made a spreadsheet and noted the LCCN for each name as we deposited them in the NAF. What we don’t have yet is a list of full URIs associated with each name, even though we know they exist in NAF/VIAF. There are tools for transforming our identifier numbers into URIs ranging from regular expressions to things like Google Refine. We’re in the process of converting our id’s to those URIs. Exposing linked data can be a simple as putting up .csv file on the internet. We could do that and call ourselves done. Somebody can grab this and do something with it. But they’d have to do a bit of work to do something useful. As it stands, all we have is a list of end points even when we’ve transformed our id’s to URIs. We have objects but not subjects or predicates. We’re missing relationships and links. Build it and they will come is a nice concept but the real power of linked data comes when you can let machines do something with the data you’ve exposed. That means using the RDF data model. Remember, RDF is not a data format – it’s a pattern for describing resources in the form of subject, predicate, object triples. You need to put your data into a format that can be transmitted across the network (called “serialization” in techie parlance). The two most common serialization formats are RDF/XML and RDFa. You may have noted in the illustration of the process that there are “RDF-izers” for taking your Excel data and creating an RDF file. This would get us to the Berners-Lee step 3 & 4I’ve been told that Google Refine has an RDF plug-in that can do this for us.
Once you’ve got the data ready you’ve got to put it out there. This is a diagram visualizing the types of data linkages that can happen once your Linked Data is in the wild. Exposing the data is like ftp’ing an HTML file to your web server. All you’re doing is putting your RDF file up on your server. That static RDF page, which has full triples of URIs which are linked to other URIs is now available. In order for machines to find it, you’d need to ensure that you serve RDF files in response to HTTP requests. Of course exposing the data can also be done in a database driven way where you put your RDF into a thing known as a “triple-store” which is query-able by applications –usually using SPARQL. Doing that is way way beyond me at this point. This is where the “talk to your programmer” part comes into play. The machines will know what to do with it once it’s out there.
I’m going to stop now before your brains burst. We’ve gone through a metric ton of information and I know it’s a lot to absorb. Be kind to yourselves and don’t expect it to stick all at once. If you’re like me, it’s going to be an ongoing process. We’ve hit all of our learning objectives. We know that Linked Data describes a method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies such as HTTP and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried. I discussed many benefits of using Linked Data. My top 3 would be: Share-ability (no more cataloging silos), Extensibility – not having to use the old ILS tools, and the – potential of doing scholarly communication in new ways, adding LD to the mark-up of scholarly text. The process of creating LD, highly simplified, is – getting your data, getting URIs for that data, putting the URI into RDF triple, and exposing that data on the web. Using LD created by others – I fudged a little here by telling you to talk to your programmers. But really, this requires a high level of technical savvy to do the dynamic query-able LD. The way we’re going to start making use of LD created by others is within more traditional cataloging processes. RDA vocabularies exist and URIs can be used in your record creation process; new ILS will have Linked Data capacity which will let your incorporate external LD links.
Thanks again for having me. This has given me the opportunity to further my own understanding of Linked Data. My goal is that someday Metadata Librarians will understand it and make use of it in awesomely creative ways.
Transcript of "Linked data for Libraries, Archives, Museums"
Linked Datafor Libraries, Archives, Museums
Learning objectives•Define the concept of linked data•State 3 benefits of creating linked data and making it available•Outline the process of creating LD•State how to make use of LD created by others
Linked Data (LD)"a term used to describe a recommendedbest practice for exposing, sharing, andconnecting pieces of data, information, andknowledge on the Semantic Web usingURIs and RDF." http://linkeddata.org/faq
Linked Open Data (LOD) Linked Data that is explicitlypublished under an open license.Not all Linked Data will be open, and not all Open Data will be linked
LOD-LAM Linked Open Data inLibraries Archives Museums #lodlam
URI Uniform Resource IdentifierA string of characters used to identify a name or resource on the Internet
RDF Resource Description Framework“a metadata data model. It has come to be usedas a general method for conceptual descriptionor modeling of information that is implementedin web resources, using a variety of syntaxformats.” Wikipedia
RDFDefined statements compromising asubject, a predicate (property), andan object.These statements are called “triples”
SPARQLSPARQL protocol and RDF Query LanguageSPARQL Endpoint: “URL for a given set of RDF datathat you can send queries to and get answers from” Dorothea Salo
Linked Data (LD)Linked data “describes a method of publishing structured dataso that it can be interlinked and become more useful. It buildsupon standard Web technologies such as HTTP and URIs, butrather than using them to serve web pages for human readers, itextends them to share information in a way that can be readautomatically by computers. This enables data from differentsources to be connected and queried” Wikipedia definition
resource links to to links resource links to links to links to links to resource links to links to links to resourceresource Diagram by Emily Nimsakont
data links to data data data links to datadata links to data data data data links to data data data data data Diagram by Emily Nimsakont
Relationship grammar relatedTo Resource B Resource ADescribe resources using interrelated “statements” (RDF triples)Use URIs – unique globally managed identifiers as the “words” ofthe statement Slide by DCMI tutorial “What makes the Linked Data Approach Different”
Traditional metadata = Implicit Relationships MARC Bibliographic Record 100 10 Smart, Laura J. ǂq (Laura Jean), ǂd 1971- 245 00 Women in Science ǂh [electronic resource].
Linked Data is Explicit isCreatorOf Women in ScienceLaura J. Smart isTitleOfWomen in Science sdsc.edu/ScienceWomen Object – predicate - subject
Triple with URIsLaura J. Smarthttp://viaf.org/viaf/171972263is creator ofhttp://purl.org/dc/terms/creatorWomen in Sciencehttp://www.sdsc.edu/ScienceWomen
Under the hood<?xml version="1.0" encoding="utf-8"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"xmlns:dc="http://purl.org/dc/elements/1.1/"<rdf:Descriptionrdf:about="http://www.sdsc.edu/ScienceWomen"><dc:title>Women in Science</dc:title><dc:creator dc:source="http://viaf.org/viaf/171972263/"rdfs:Literal="Laura J. Smart" /></rdf:Description></rdf:RDF>
It’s the data, stupid.“We’re not dealing with recordsanymore. We are working withinterrelated nodes of data” Diane Hillmann
What does it really look like?“Thisis kind of like asking whatelectricity looks like: it doesnt somuch look like anything, as itmakes certain things possible” Karen Coyle
Benefits of creating/using Linked Data• Enhanced publications• Facilitate research• Separate semantics from syntax• Persistent URIs an aid to digital preservation• Drive users to your site• Collaborate with less licensing hassle (LOD)
All the kids are doing it“The new bibliographic frameworkproject will be focused on the Webenvironment, Linked Data principlesand mechanisms, and the ResourceDescription Framework (RDF) as abasic data model.” LC Bibliographic Framework for the Digital Age http://www.loc.gov/marc/transition/news/framework-103111.html
How to? “Learn about Resource Description Framework. Never look back.” Rurik Greenall, Norwegian Institute Science & TechnologyOther prerequisites: HTML. URIs.
Berners-Lee Basic Linked Data Principles1. Use URIs as names for things2. Use HTTP URIs so that people can look up those names3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)4. Include links to other URIs. so that they can discover more things.
The process: 1st get your dataFeynman, Richard Phillips, 1918-1988LCCN: n 50002729http://viaf.org/viaf/44298691http://lccn.loc.gov/n50002729
The process: Get your data into RDF/XMLFrom here: Name LCCN Robert B. Phillips n 00014131 Keith C. Schwab nr2002032640To here: Robert Phillips http://lccn.loc.gov/n00014131 Creator http://purl.org/dc/terms/creator Book title http://openlibrary.org/books/OL11358296M Physical biology of the cell
Learning objectives•Define the concept of linked data•State 3 benefits of creating linked data and making it available•Outline the process of creating LD•State how to make use of LD created by others