Building a Linked Open Data Set

1,155 views

Published on

Using the conversion of Taxonomic Literature II as an example, I discuss in this high-level presentation some things to keep in mind while creating a linked open data set.

Also I present a few examples and links to LOD data sets and more information.

Published in: Technology, Education
1 Comment
1 Like
Statistics
Notes
  • Hey, this is the kind of thing that might be a good talk for DCPHP. Would you be interested in doing a slightly more tech-centric version there?

    http://www.meetup.com/DC-PHP/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,155
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide
  • Originally this presentation was going to center around a discussion of our conversion of TL2 to linked data and what we learned, but I felt that it would be better to use it as an example of things to keep in mind when creating your own data sets.
  • Situated at the center of the world's largest museum complex, the Smithsonian Libraries forms a vital part of the research, exhibition, and educational enterprise of the Institution. The Libraries unites 20 libraries into one system supported by central collections support services. We maintain publication exchanges with more than 4,000 institutions worldwide that supply Smithsonian scientists and curators with current periodicals, exhibition catalogs, and professional society publications. Through preservation treatments, experts work to save the Smithsonian's 1.5 million printed books and manuscripts for future generations. Our Digital Library creates electronic versions of rare books and other distinctive collections, as well as exhibitions and specialized finding aids. We can be found on the web at http://library.si.edu
  • A brief summary of what this presentation includes.
  • I dislike disclaimers, but we’re still new to linked open data and are learning as we go. The idea of LOD has been around for several years now, so we are also playing a bit of catch-up.Our first goals are to get some data online and then start linking our dataout to other sources, and encourage others to link to us. We don’t yet know how our data relates to others. It’s not scientific datacreated as part of a research project per se, but initially we see it as valuable, useful information at least for some segements of the research world.
  • Since this presentation doesn’t center around the idea of what linked data is, we’re not going to spend any time on it. But just in case…Question: How many are familiar with linked data? Have linked data online? Wish they had linked data? Wish you had a website?This page is a quick summary for those who don’t know what linked data is. RTFM (Read The Friendly Manual!)
  • This is a quick demonstration of how linked data has grown over the past five years. Back in 2007 we had only a handful of data sets, at least according to Richard Cyganiak’s searching. Between 2009 and 2010 the number of items doubles. As of Sept 2011 there are 295 data sets listed. There are probably more today and more being added every day.It is likely that not all data sets are represented here, so this is only a sample of what’s available.What’s the point? This is all data that has the potential to enhance YOUR data. This is all linked data. This is all open data.
  • So as an example of how to create a data set, I’ll use Taxonomic Literature II. It is a fifteen volumes guide to the literature of systemic botany published between 1753 and 1940. It contains almost 10,000 authors and about 37,000 publications.The reason to focus on TL2 is that we aim to be the authority on the web for this information. We have received permission from the IAPT (Intl Assoc for Plant Taxonomy) to digitze and release this information on the web under an open license. TL-2 is used by most? botanists and their work is made easier by this data being online. Prior to 2012 this information was either located in a library or locked behind a paywall of sorts.
  • This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
  • Briefly this was out process to create the data. In Jan 2011, we scanned the books and placed them online at the Internet Archive. Later, after selecting a contractor, we sent the scans and the OCR text (created at the Internet Archive) to a contractor who ultimately created a 99.97% accurate text version of TL-2. They then parsed that data to a limited degree and delivered to us an XML dataset that we then imported to a SQL Server database.Finally, we created a searchable, browseable website to access the TL-2 data, opening it up to researchers around the world. Two of them use it on a regular basis. (rimshot!) In reality in a month, we get about 500 people visiting and 6000 pageviews, with about 60% of those coming from outside of the U.S.
  • This is the current website that we have that shows a sample of the search results for Charles Darwin. This is not Linked Data.You can find this page at: http://www.sil.si.edu/digitalcollections/tl-2/
  • Earlier we mentioned 99.97% accuracy. This means that if we assume 38 million characters in all of TL-2 that there are upwards of 12,000 errors in our text. (In reality this is more like 5,000-6,000 due to the nature of our data)This may not be bad for the textual components of the content, but when it comes to parsing citations or more structured information, this will prove to be a challenge. Other data sets may not have this problem, but as we are scanning and converting to text, this something that will always be present for us.
  • So how do we create linked data? Basically this is the approach we are using. There’s probably more that needs to be done, but today, this is what we know we need to do.The choice of identifier is important because if possible, it should be human friendly, but numbers are also common in places such as OCLC WorldCat. Additionally, the TL-2 Number is a strong component, so we will very likely go with that as our primary identifier of publications.
  • Mondeca, an indormation management company based in Paris, as part of their “labs”, created a directory of linked open vocabularies and grouped them together by similar disciplines. Starting from largest to smallest, they are General and Meta, Library, City, Web, Space-Time, Science, Market (and finance) and Media. Library is the second largest on this list, which may be a matter of how the visualization is created, but may also be that libraries are playing a big part in the LOD movement.This might be helpful in helping figure out which vocabularies might be useful to you.
  • A sample of our TL-2 Identifiers and four triples. Note that “tl2:creator” is not the same as “dc:creator” and indicates that we will likely need to create our own ontology for describing the TL-2 dataset.(dc:creator is a reference from a title to an author. We also need the reverse, author to title)Also note that we’ve crosslinked our two idenifiers, and as an example, linked out to other information on the web. The link to the Internet Archive may not be appropriate as it is not a LOD data set, but there is likely a predicate available to “read more” or “see also” for non-LOD websites that are related to the identifier.
  • A further breakdown of our data into linked data showing the predicates we might use for each. Again, the items in orange are specific to TL2 and may not exist in other LOD data sets. For example, the FOAF vocabulary has date of birth, but can we use only a year in that field? Will that foul up other computers? FOAF also doesn’t include date of death, which we definitely have. What predicate do we use? Do we create our own ontology and publish it? (probably)Finally, we haven’t yet begun a formal analysis of which existing ontologies might fit our needs.
  • Storage is a consideration. We’re not using a triplestore per se, but are instead relying on Drupal and ARC2 to handle the magic for us. This may or may not be a good solution for the long term.The next four slides are all text. You’ve been warned.
  • Performance is also a concern. It’s been challenging enough to get 47,000 records imported into Drupal. When we start to talk about an additional 500K items, then we have some serious concerns about how well Drupal will hold up, just on the import side of things. We may need to invesigate other methods of getting this data into Drupal, or other systems altogether, but that may create added complexity.
  • Another example to be clear about how much data you are creating and how to manage it. The US Census sent the “long form” to a subset of 19 million households. These responses were converted to LOD by Joshua Tauberer and resulted in over a billion triples. I’m going to think very carefully before I start working with a billion of anything.
  • A few notes on software that can be used to open up your existing data to linked data. I have not had the opportunity to use any of this data yet, but we may still use it in the future.ARC2 – Provides parsers, content negotiation, RDF storage, SPARQL endpointD2RQ – Allows accessing relational databases as virtual RDF graphsTriplify – Plugin for Web applications to expose your data as RDF, Linked Data or JSON.Virtuoso – Enterprise level product for normalizing all of your data sources, includes providing that data as RDF
  • Why should we create linked data? Disambigutation – Are you searching for Venus the planet, Venus the sculpture, Venus the painting or Venus the tennis player. Connecting Relevant Info – Linking your data to other data may reveal things that are related to your data that you were unware of. Search Visibility – Search engines, via schema.org and Google’s purchase of Freebase is enhanching search. Things will only get better as we move forward.Enrichment of your Data – Mentioned earlier, you may learn things you didn’t know about your data or provide greater context to your data via LOD.Easier Reuse – This is one of the central tenets of LOD. I, as a human, no longer need to say that your Column B in your spreadsheet corresponds to the first_name field in my database.
  • Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambigutate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
  • Example of LOD in action. Combines data from the Energy Information Administration (EIA) on Data.gov with data from OpenEI.org, the U.S. Census and SmartGrid in a mashup that’s easier to create with LOD. http://en.openei.org/apps/mashathon2010/
  • Example of LOD in action. NYTimes is offering a large dataset as LOD. As an example, they provided a tool to enter a university or college and find those people from their database who attended that institution. From there, we are able to see links to other databases and articles from NYTimes that refer to that person. All linked together.From the site: “As of 13 January 2010, The New York Times has published approximately 10,000 subject headings as linked open data under a CC BY license. We provide both RDF documents and a human-friendly HTML versions. The table below gives a breakdown of the various tag types and mapping strategies on data.nytimes.com.”http://data.nytimes.com/
  • This is an example of the “Raw data” available at NYTImes, presented in auser readable form. I could also make the argument that the identifier at NYTimes is not as good as it should be. A human readable version would be better, but we see that is one of the owl:sameAs links.
  • At OCLC Worldcat, they have begun publishing the data about an individual item in Linked Open Data using schema.org. This is an example from Darwin’s Origin of Species. You’ll find the “Linked Data” section at the bottom of the page for the details of any individual book on WorldCat.http://www.worldcat.org/oclc/7619054
  • Finally, a few other examples of places where you can learn more about linked data, examples of other tools built with and for linked open data. The Library of Congress has made available their subject headings in linked data form to both humans and machines. Schema.org encourages the use of your metadata as a variant of linked data in your webpages. The US Government’s source for open data. Other countries are also making their data open on similar websites. There are many, many more sources, so search the web and see what you can find.
  • Thank you!As for brew pubs, I don’t live in Chicago and this is only my second time here, so I’m open to suggestions. There are a lot of bars in this town (as seen in the map)
  • Building a Linked Open Data Set

    1. 1. Implementing a Linked Open Data set Joel Richard Smithsonian Libraries richardjm@si.edu SLA Annual Conference, July
    2. 2. Who are the Smithsonian Libraries? • 20 Libraries in the U.S. and Panama • Supports research of staff and the public • Strong effort to digitize pre-1923 texts • Taxonomic Literature II is one of these textsJoel Richard, SLA Annual Conference, July
    3. 3. Summary of Agenda • Our data set and process • Conversion to Linked Data • Storing Linked Data • Examples and More Info • Summary • … and Best brew pubs in ChicagoJoel Richard, SLA Annual Conference, July
    4. 4. Disclaimer We are still learning.Joel Richard, SLA Annual Conference, July
    5. 5. What is Linked Data? HTTP URIs identify things to Humans and computers Identifiers are related to other identifiers (or values) via predicates in a “triple”: Charles Darwin // Creator // On the Origin of Species See also : http://linkeddata.org/ http://en.wikipedia.org/wiki/Linked_Data http://richard.cyganiak.de/2007/10/lod/Joel SLA Annual Conference, July
    6. 6. http://richard.cyganiak.de/2007/10/lod/Joel Richard, SLA Annual Conference, July
    7. 7. Taxonmic Literature IIEssential Reference Tool for BotanistsAuthors and their Publications from 1753 to 1940It is a “database in book form.”
    8. 8. Joel Richard, SLA Annual Conference, July
    9. 9. Our process Scanned the pages Hired contractor for OCR and correction (99.97% accuracy) Received XML dataset from Contractor Verified and Imported to SQL Server Built a website to search the dataJoel Richard, SLA Annual Conference, July
    10. 10. Joel Richard, SLA Annual Conference, July
    11. 11. Great! Let’s make some linked data! First...what does 99.97% accuracy mean? ~12,000 ErrorsJoel Richard, SLA Annual Conference, July
    12. 12. Great! Let’s make some linked data! Select Identifiers for your data http://library.si.edu/tl-2/author/darwin http://library.si.edu/tl-2/title/origin_of_species http://library.si.edu/tl-2/title/1313 Choose vocabularies for predicates(harder than it sounds) OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIBO, etc.Joel SLA Annual Conference, July
    13. 13. Mondeca Labs Linked Open Vocabularies (LOV) Vocabulary of a Friend (VOAF) A vocabulary for describing other vocabularies http://labs.mondeca.com/dataset/lovJoel SLA Annual Conference, July
    14. 14. http://library.si.edu/tl2/author/darwin tl2:creator http://library.si.edu/tl2/title/1313 owl:sameAs http://viaf.org/viaf/27063124 http://library.si.edu/tl2/title/origin… dc:creator http://library.si.edu/tl2/author/darwin owl:sameAs http://www.archive.org/details/ originofspecies00darwuoftJoel Richard, SLA Annual Conference, July
    15. 15. http://library.si.edu/tl2/author/darwin RDF Type = foaf:Person foaf:lastName, foaf:familyName foaf:firstName, foaf:givenName foaf:name, skos:prefLabel tl2:birthYear tl2:deathYear skos:definition tl2:personAbbreviation http://library.si.edu/tl2/title/origin… RDF Type = bibo:Book tl2:titleNumber dc:title event:place dc:publisher tl2:titleAbbreviation dc:createdJoel SLA Annual Conference, July
    16. 16. Great! Let’s make some linked data! How are we going to store all this? We’re using Drupal. RDFa is built-in, RDF extensions is an add-on module. Probably not a good idea for very large datasets. TL-2: 10,000 authors + 37,000 titles becomes about 400,000 triples.Joel SLA Annual Conference, July
    17. 17. Storage considerations Performance of Drupal Import: Feeds Import: 7 Hours for 35k Records Other options? Still searching… Our linked data set will grow to at least 600-700k Drupal nodes. Is Drupal the best way to do this?Joel SLA Annual Conference, July
    18. 18. Storage considerations 2000 US Census 19 million households received “long form” Joshua Tauberer: converted to 1bln triples http://www.rdfabout.com/demo/census/ Carefully consider your storage options!Joel SLA Annual Conference, July
    19. 19. Storage ARC2 used by Drupal 7 RDBMS via D2RQ RDBMS via Triplify OpenLink Virtuoso See Also: http://www.w3.org/2001/sw/rdb2rdf/use-cases/Joel Richard, SLA Annual Conference, July
    20. 20. Linked Data. What’s the point? Disambiguation Connecting Relevant Information More visible via search Enrichment of your data Easier reuse of dataJoel Richard, SLA Annual Conference, July
    21. 21. Joel Richard, SLA Annual Conference, July
    22. 22. http://en.openei.org/apps/mashathon2010/Joel SLA Annual Conference, July
    23. 23. http://data.nytimes.com/schools/schools.htmlJoel SLA Annual Conference, July
    24. 24. http://data.nytimes.com/N38444093941437235523Joel SLA Annual Conference, July
    25. 25. http://www.worldcat.org/oclc/7619054Joel Richard, SLA Annual Conference, July
    26. 26. Other Examples and Info Library of Congress: Linked Data Services http://id.loc.gov/ Schema.org http://www.schema.org Data.gov / Semantic http://www.data.gov/semantic Linked Data.org http://linkeddata.org/ Stephen Dale: Linked Data in Action http://www.slideshare.net/stephendale/linked-data-in-action-4487244Joel Richard, SLA Annual Conference, July
    27. 27. Thank you! richardjm@si.edu http://slideshare.net/joelrichard ?Joel Richard, SLA Annual Conference, July

    ×