• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Linked Open Data and Systematic Taxonomy

Linked Open Data and Systematic Taxonomy



A short talk in which I briefly discuss the Smithsonian Libraries' plans for Linked Open Data related to our Taxonomic Literature II and Index Animalium digitization projects.

A short talk in which I briefly discuss the Smithsonian Libraries' plans for Linked Open Data related to our Taxonomic Literature II and Index Animalium digitization projects.



Total Views
Views on SlideShare
Embed Views



1 Embed 2

https://twitter.com 2


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Originally this presentation was going to center around a discussion of our conversion of TL2 to linked data and what we learned, but I felt that it would be better to use it as an example of things to keep in mind when creating your own data sets.
  • Situated at the center of the world's largest museum complex, the Smithsonian Libraries forms a vital part of the research, exhibition, and educational enterprise of the Institution. The Libraries unites 20 libraries into one system supported by central collections support services. We maintain publication exchanges with more than 4,000 institutions worldwide that supply Smithsonian scientists and curators with current periodicals, exhibition catalogs, and professional society publications. Through preservation treatments, experts work to save the Smithsonian's 1.5 million printed books and manuscripts for future generations. Our Digital Library creates electronic versions of rare books and other distinctive collections, as well as exhibitions and specialized finding aids. We can be found on the web at http://library.si.edu
  • I dislike disclaimers, but we’re still new to linked open data and are learning as we go. The idea of LOD has been around for several years now, so we are also playing a bit of catch-up.Our first goals are to get some data online and then start linking our dataout to other sources, and encourage others to link to us. We don’t yet know how our data relates to others. It’s not scientific datacreated as part of a research project per se, but initially we see it as valuable, useful information at least for some segements of the research world.
  • So as an example of how to create a data set, I’ll use Taxonomic Literature II. It is a fifteen volumes guide to the literature of systemic botany published between 1753 and 1940. It contains almost 10,000 authors and about 37,000 publications.The reason to focus on TL2 is that we aim to be the authority on the web for this information. We have received permission from the IAPT (Intl Assoc for Plant Taxonomy) to digitze and release this information on the web under an open license. TL-2 is used by most? botanists and their work is made easier by this data being online. Prior to 2012 this information was either located in a library or locked behind a paywall of sorts.
  • This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
  • This is the current website that we have that shows a sample of the search results for Charles Darwin. This is not Linked Data.You can find this page at: http://www.sil.si.edu/digitalcollections/tl-2/
  • Index Animalium, published in the late 1800s and early 1900s, contains 430,000 species names for 7000 scientific volumes published between 1758 and 1840. Charles Davies Sherborn dedicated much of his life to this work. The volumes consist of the index to species with one species + citation per line and a bibliography listing the titles that Sherborn read. Challenges in the data include inconsistent citation formats, two kinds of abbreviations, both in the index and in the bibliography, as well as errors introduced during the printing process.
  • This is one example of a page from Index Animalium for Papilio (Danaus) plexippus, AKA the Monarch Butterfly. The abbreviations:Linnaeus: Carl LinnaeusSyst. Nat.: SystemaNaturaeEd 10: 10th edition1758: Publication Year471: Page 471Also 12th Edition, published in 1767, page 767.
  • Identified here are the “easy” to identify data elements that can be brought to linked data. We still need to contend with the challenges associated with the parsing of these into actual citations. The TL-2 data at the top has already been parsed and loaded into a database. Index Animalium is posing a greater challenge and will take longer to complete.
  • A further breakdown of our data for TL-2 into linked data showing the predicates we might use for each. Again, the items in orange are specific to TL2 and may not exist in other LOD data sets. For example, the FOAF vocabulary has date of birth, but can we use only a year in that field? Will that foul up other computers? FOAF also doesn’t include date of death, which we definitely have. What predicate do we use? Do we create our own ontology and publish it? (probably)Finally, we haven’t yet begun a formal analysis of which existing ontologies might fit our needs.
  • 80/20 Rule: You spend 20% of your time on 80% of the work and 80% of your time on the 20% of the work. We are at that point with Index Animalium. We would like to do further parsing of data with TL-2 but it will pose similar challenges to that of Index Animalium.
  • Some potential sources of data that we can link to. We’d like to one day have some of these link back to us, thereby competing the circuit for a linked data web of knowledge.
  • This is what we would like to do:A researcher enters a botanist name or a species name and is taken directly to the page in the book referenced by that entry. If the book is not known to be digitized and online, then we can redirect them to OCLC worldcat to find a copy of that book in their local library.This is a great improvement for those who wouldn’t normally have access to these books in their local library.

Linked Open Data and Systematic Taxonomy Linked Open Data and Systematic Taxonomy Presentation Transcript

  • Linked Open Data andSystemic TaxonomyJoel RichardSmithsonian Librariesrichardjm@si.eduA tale of two publicationsIn three acts
  • Who are the Smithsonian Libraries?• 20 Libraries in the U.S. and Panama• Supports research of staff and the public• Strong effort to digitize pre-1923 texts• Index Animalium and TaxonomicLiterature II are two examplesJoel Richard,
  • DisclaimerWe are still learning.We are still building.Joel Richard,
  • Joel Richard,Act I: The Players(or, identifying the data with whichwe are working and their meaningand usefulness to the scientificcommunity.)
  • Taxonomic Literature IIEssential ReferenceTool for BotanistsBotanists/Authorsand Publicationsfrom 1753–1940Multiple indexes, “unique identifiers”It is a “database in book form”Joel Richard,
  • Joel Richard,
  • Joel Richard,
  • Joel Richard,Index AnimaliumGenus name, author& citation for430,000 animalsCovers Publicationsfrom 1758–1850Also a database, butmany challengesstill exist in the data.
  • Joel Richard,
  • Joel Richard,Act II: The Linking(or, identifying those data elements tobe linked, inherent challenges ofparsing OCR text, and identifyinglinkable remote data sources)
  • Joel Richard,Linkable Data Elements
  • Joel Richard,foaf:lastName, foaf:familyNamefoaf:firstName, foaf:givenNamefoaf:name, skos:prefLabelbio:birthbio:deathskos:definitiontl2:personAbbreviationtl2:titleNumberdc:titleevent:placedc:publisherdc:createdtl2:titleAbbreviationhttp://library.si.edu/tl2/author/darwinRDF Type = foaf:Personhttp://library.si.edu/tl2/title/origin…RDF Type = bibo:Book
  • Joel Richard,Challenges with Our Data• Errors in the Corrected OCR• Challenges in Parsing Citations• The 80/20 rule: manually makingconnections unable to be made byautomated means• Finding suitable sources of data tolink to. (DBPedia? VIAF? EOL? Others?)
  • Joel Richard,Linked Data SourcesLow-Hanging Fruit:• DBPedia• OCLC WorldCat• Biodiversity Heritage Library• Virtual International Authority File• Encyclopedia of Life• Library of Congress Subject Headings• GeoNames• Open Library
  • Joel Richard,Act III: The Sum of the Parts(or, our goals and desires for thisdata, what it means to the linkeddata world and the scientificcommunity in general)
  • Joel Richard,What’s the point?• This data may already exist online.• It may also not always be as accurateas needed for science.• We are in a position to be theauthoritative source for thisinformation.• Linked Data allows it to be easilyreused and shared.
  • Joel Richard,Danaus plexippusIndex Animalium Systema Naturae, etcAimeé AntoinetteCamus(botanist)Your Local Library( )
  • Joel Richard,One Example of ReuseRyan Schenkhttp://synynyms.com/
  • Thank you!Joel RichardRichardJM@si.eduhttp://library.si.edu/staff/joel-richardhttp://slideshare.net/joelrichard