Unlocking Taxonomic Literature II using Linked Open Data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Unlocking Taxonomic Literature II using Linked Open Data

on

  • 502 views

The Smithsonian Libraries has digitized Taxonomic Literature II, an essential research tool for Botanists. This presentation, with audio, starts with a description of Linked Data, a history of TL-2 ...

The Smithsonian Libraries has digitized Taxonomic Literature II, an essential research tool for Botanists. This presentation, with audio, starts with a description of Linked Data, a history of TL-2 and some of the methods and challenges we are encountering as we convert it to an digital version and Linked Open Data.

Statistics

Views

Total Views
502
Views on SlideShare
500
Embed Views
2

Actions

Likes
0
Downloads
2
Comments
0

1 Embed 2

https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is a quick demonstration of how linked data has grown over the past five years. Back in 2007 we had only a handful of data sets, at least according to Richard Cyganiak’s searching. Between 2009 and 2010 the number of items doubles. As of Sept 2011 there are 295 data sets listed. There are more today and more being added every day.Not all data sets are represented here, so this is only a sample of what’s available. The actual graph could be four or five times larger by now.What’s the point? This is all data that has the potential to enhance YOUR data. This is all linked data. This is all open data.
  • The basic unit of LOD is the “triple” made up of three elements. An identifier, a predicate and another identifier or a value of some kind. Think of it as a sentence: Subject-verb-object. The underlined blue text indicate that this is an identifier that can be linked to on the web. The first part of the triple is always an identifier. The third part is sometimes an identifier but should be if an identifier exists.When we repeat these connections, we start to create a web of networked data.
  • Looking back, we can see that Tim Berners Lee has mapped out these four principles that make up the foundation of linked data, which also give it structure and make it easy to use.
  • Going back to our web of data, we can now represent the identifiers as identifiers.The next question is: where do we get the predicates from? Why are they important?
  • There are numerous vocabularies of predicates that we can use when developing our linked open data. (Describe them more in detail, leading into the next slide)
  • Wow, look at al of them! Mondeca labs has collected and classified all the vocabularies they can find. There are 350 vocabularies listed here.
  • Here is an example of some linked data in a reasonably human-readable form. We have some prefix definitions of the predicate vocabularies we are using. Then we have the identifier in green, and the predicates in blue. Values are in black with identifiers enclosed in greater-than and less-than signs.
  • What are the benefits of LOD?
  • Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambiguate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
  • Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambigutate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
  • Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambiguate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
  • Here are some more examples of places you can go for linked data. The Library of Congress has a linked data services for their authorities and vocabularies. Schema.org is being used within webpages to improve their visibility and search results. The US Government is offering a lot of data, some of it in linked data. LinkedData.org is a place to go to learn about all things linked data and finally, Stephen Dale, a knowledge management consultant, has a great presentation with examples of linked data in use to learn more than we knew before.
  • Overall, TL-2 provides the most comprehensive biographical and bibliographical analysis for systematic botany literature published between 1753 and 1940 to date.
  • Here is a page from TL-2. It’s hard to read. Let’s zoom in a bit.
  • When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
  • Continuing our zooming… This includes some additional information that we know about Charles Darwin, including places where we can find known samples of his handwriting, species that were named for him and even postage stamps that honor him.
  • Continuing our zooming… Here we see three publications by Darwin giving a number of the book, the title and publication information.
  • The things that make TL-2 important are the unique abbreviations of the author names. e.g. “Darwin” outlined in Green. Also significant are the abbreviations of the titles of the publications, also outlined in green (“Srigin sp.”), but not all publications have titles. In red are the book numbers, also unique across all 37,000 publications. Finally, we have the “short title” of the volumes which is outlined in blue.
  • Briefly this was out process to create the data. In Jan 2011, we scanned the books and placed them online at the Internet Archive. Later, after selecting a contractor, we sent the scans and the OCR text (created at the Internet Archive) to a contractor who ultimately created a 99.97% accurate text version of TL-2. They then parsed that data to a limited degree and delivered to us an XML dataset that we then imported to a SQL Server database.Finally, we created a searchable, browseable website to access the TL-2 data, opening it up to researchers around the world. Two of them use it on a regular basis. (rimshot!) In reality in a month, we get about 500 people visiting and 6000 pageviews, with about 60% of those coming from outside of the U.S.
  • This is the current website that we have that shows a sample of the search results for Charles Darwin. This is not Linked Data. This page got approximately 860 visitors and 1500 visits in the month of April 2013. Which is twice the number of visitors we got in April 2012. We actually get more visits from Europe than from North America. You can find this page at: http://www.sil.si.edu/digitalcollections/tl-2/
  • Earlier we mentioned 99.97% accuracy. This means that if we assume 38 million characters in all of TL-2 that there are upwards of 12,000 errors in our text. (In reality this is more like 5,000-6,000 due to the nature of our data)This may not be bad for the textual components of the content, but when it comes to parsing citations or more structured information, this will prove to be a challenge. Other data sets may not have this problem, but as we are scanning and converting to text, this something that will always be present for us.
  • This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
  • This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
  • This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
  • When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
  • When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
  • When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
  • As an exmaple, wikipedia has 3000 botanists in their database. We have 10,000 of them. We have the more complete, richer set of data that can be used to

Unlocking Taxonomic Literature II using Linked Open Data Presentation Transcript

  • 1. Joel Richard, Smithsonian LibrariesUnlocking Taxonomic Literature IIusing Linked Open Data
  • 2. • What is Linked Open Data / The Semantic Web?• Where can I see LOD in use?• What is Taxonomic Literature II?• How is it being converted to LOD?• Did we encounter any challenges?Agenda
  • 3. Linked dataFrom Wikipedia, the free encyclopediaA method of publishing structured data so that it can beinterlinked and become more useful. It builds uponstandard Web technologies … [and] extends them toshare information in a way that can be readautomatically by computers. This enables data fromdifferent sources to be connected and queried.What is Linked Open Data?http://en.wikipedia.org/wiki/Linked_Open_Data
  • 4. What is the Semantic Web?Semantic WebFrom Wikipedia, the free encyclopedA movement led by the World Wide Web Consortium… topromote common data formats on the Web.By encouraging the inclusion of semantic content in webpages, the Semantic Web aims at converting the currentweb dominated by unstructured and semi-structureddocuments into a "web of data"."The Semantic Web provides a common framework thatallows data to be shared and reused acrossapplication, enterprise, and community boundaries."http://en.wikipedia.org/wiki/Semantic_Web)
  • 5. Five Stars of Linked Open DataAvailable on the web (in any format) but with an openlicense, to be Open Data.Available as machine-readable structured data (e.g.excel instead of image scan of a table.)As (2) plus non-proprietary format (e.g. CSV instead ofMicrosoft Excel.)All the above plus, Use open standards from W3C (RDFand SPARQL) to identify things, so that people canpoint at your stuff.All the above, plus: Link your data to other people’sdata to provide context.What is Linked Open Data?★★★★★★★★★★★★★★★http://www.w3.org/DesignIssues/LinkedData.html
  • 6. What is Linked Open Data?LinkingOpen Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  • 7. What is Linked Open Data?Charles Darwin“Feb 12, 1809”ShrewsburyBornOnBorn InCityEnglandTypeIs InPersonTypeCountryTypeCharles Darwin “Feb 12, 1809”BornOnIdentifier Predicate Identifier /Value(subject) (verb/relationship) (object)On the Originof SpeciesAuthor Of
  • 8. Tim Berners-Lee outlined four principlesfor linked open data:1. Use URIs to denote things.2. Use HTTP URIs so that these things can bereferred to and looked up ("dereferenced")by people and user agents.3. Provide useful information about the thing when its URI isdereferenced, leveraging standards such as RDF, SPARQL.4. Include links to other related things (using their URIs) whenpublishing data on the Web.What is Linked Open Data?http://www.w3.org/DesignIssues/LinkedData.htmlhttp://5stardata.info/
  • 9. What is Linked Open Data?http://dbpedia.org/resource/Charles_Darwin“Feb 12, 1809”http://dbpedia.org/resource/ShrewsburyBornOnBorn InCityhttp://dbpedia.org/resource/United_KingdomTypeIs InPersonTypeCountryTypeIdentifier Predicate Identifier /Valuehttp://dbpedia.org/resource/On_the_Origin_of_SpeciesAuthor OfPredicate Identifier /Value
  • 10. What is Linked Open Data?Predicate Vocabularies• Dublin Core – General Metadata for Discovery• SKOS – Simple Knowledge Organization System• BIBO – Bibliographic Ontology• BIO – Biographical• FOAF – Friend of a Friend• Events…• Geographic…• Many others!• OWL – Web Ontology Language
  • 11. What is Linked Open Data?Mondeca LabsLinked OpenVocabularies (LOV)Vocabulary of a Friend(VOAF)A vocabulary fordescribing othervocabularieshttp://labs.mondeca.com/dataset/lov
  • 12. What is Linked Open Data?@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix dbpprop: <http://dbpedia.org/property/> .<http://dbpedia.org/resource/Charles_Darwin>rdf:type <http://xmlns.com/foaf/0.1/Person>;rdf:type <http://dbpedia.org/ontology/Scientist>;foaf:name “Charles Darwin”;foaf:depiction “http://upload.wikimedia.org/…/Charles_Darwin_seated_crop.jpg”;dbpedia-owl:field <http://dbpedia.org/resource/Natural_history>dbpprop:placeOfBirth "Mount House, Shrewsbury, Shropshire, England”;dbpedia-owl:birthDate "1809-02-12";dbpedia-owl:birthPlace <http://dbpedia.org/resource/Shrewsbury>dbpedia-owl:deathDate "1882-04-19";dbpedia-owl:deathPlace <http://dbpedia.org/resource/Down_House>dbpprop:awards <http://dbpedia.org/resource/Royal_Medal>
  • 13. What is Linked Open Data?Benefits of Linked Open Data• Disambiguation• Connecting Relevant Content• More visibility via Search• Enrichment of your data• Easier reuse of data
  • 14. Linked Open Data in UseGoogle Knowledge Graph
  • 15. Linked Open Data in UseGoogle Knowledge Graph
  • 16. Linked Open Data in Use
  • 17. Congress: Linked Data Serviceshttp://id.loc.gov/Schema.orghttp://www.schema.orgData.gov / Semantichttp://www.data.gov/semanticLinked Data.orghttp://linkeddata.org/Stephen Dale: Linked Data in Actionhttp://www.slideshare.net/stephendale/linked-data-in-action-4487244Other LOD Examples and Information
  • 18. Taxonomic Literature: A selective guide to botanicalpublications and collections with dates, commentariesand types. (Stafleu et al.)Essential ReferenceTool for BotanistsAuthors and theirPublications from1753 to 1940It is a “database in book form.”Taxonomic Literature II
  • 19. Taxonomic Literature II
  • 20. Taxonomic Literature II
  • 21. Taxonomic Literature II
  • 22. Taxonomic Literature II
  • 23. Taxonomic Literature II
  • 24. Scanned the pages.Uploaded to the Internet Archive.Hired contractor for OCR and correction (99.97%accuracy.)Received XML dataset from Contractor.Verified and Imported to SQL Server Database.Built a website to search the data.Taxonomic Literature II
  • 25. Taxonomic Literature II
  • 26. First...what does 99.97% accuracy mean?Taxonomic Literature II~12,000 Errors
  • 27. 1. Select Identifiers for our datahttp://library.si.edu/digital-library/tl-2/author/darwinhttp://library.si.edu/digital-library/tl-2/title/origin_of_specieshttp://library.si.edu/digital-library/tl-2/title/13132. Choose vocabularies for predicates (harder than itsounds)OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIBO, etc.3. Create Links to other data sources on the web.Taxonomic Literature II
  • 28. Taxonomic Literature II as Linked Datahttp://library.si.edu/tl2/author/darwinhttp://library.si.edu/tl2/title/1313tl2:creator <http://library.si.edu/tl2/title/1313>owl:sameAs <http://viaf.org/viaf/27063124>dc:creator <http://library.si.edu/tl2/author/darwin>owl:sameAs http://www.archive.org/details/originofspecies00darwuoftowl:sameAs <http://www.worldcat.org/oclc/425919213>Select Identifiers
  • 29. Taxonomic Literature II as Linked Data<http://library.si.edu/tl2/author/darwin>rdf:type <http://xmlns.com/foaf/0.1/Person>foaf:lastName “Darwin”foaf:familyName “Darwin”foaf:firstName “Charles”foaf:givenName “Charles”foaf:name “Darwin, Charles Robert”skos:prefLabel “Darwin, Charles Robert”bio:birth “1809”bio:death “1882”skos:defintion “British evolutionary biologist”tl2:personAbbreviation “Darwin”Select Identifiers:Authors
  • 30. Taxonomic Literature II as Linked Data<http://library.si.edu/tl2/book/1313>rdf:type <http://purl.org/ontology/bibo/Book>tl2:titleNumber “1313”tl2:titleAbbreviation “Origin sp.”tl2:shortTitle “On the origin of species”dc:title “On the origin of species by means of naturalselection, or the preservation of favoured races in the...”dc:publisher “John Murray”event:place “London”dc:created “1859”SelectVocabularies: Publications
  • 31. Taxonomic Literature II as Linked DataLinking: Author NamesUsed a combination of OpenRefine and LODRefine as well ascustom code.Results: Mixed• Matched 15 - 20% of the names in our sample set• Some named weren’t high in the list and required a human touchConclusion: Computer code needs to be improved with the aim ofminimizing amount of staff or volunteer time spent matchingnames.
  • 32. Taxonomic Literature II as Linked DataCharles Darwin(From the dbpedia.org)
  • 33. Taxonomic Literature II as Linked DataLinking: HerbariaUsed computer code to split the herbarium names and identifythem in data provided by the Biodiversity Collections Index.Results: Good• Matched 95+% of the herbarium names in all ofTL-2• Careful attention to “A” which is an herbarium, but also startssome sentences in the HERBARIUM andTYPES blocksConclusion:These will be added toTL-2 when it is launches as LOD.
  • 34. Taxonomic Literature IIMissouri Botanical Garden Herbarium(From the Biodiversity Collections Index)Lsid urn:lsid:biocol.org:col:15859Name Missouri Botanical Garden HerbariumCode MOKind HerbariumTaxon Scope Herbarium collection limited to vascular plants (5.6 millionspecimens) and bryophytes (500,000 specimens), Jan. 2009.Geo Scope Worldwide; phanerogams strong in Central America (especiallyCosta Rica, Nicaragua, and Panama), tropical South America. . .Size 6,150,000FoundedYear 1859Web Site http://www.mobot.org/Location Street P.O. Box 299Location City Saint LouisLocation State MissouriLocation Postcode 63166-0299Location Country Iso UShttp://www.biodiversitycollectionsindex.org/urn:lsid:biocol.org:col:15859
  • 35. Taxonomic Literature II as LODHow are we going to store all this?We’re using Drupal – automatically embed someLinked Open Data elements in the webpage.Probably not a good idea for very large datasets.TL-2 = 10,000 authors + 37,000 titles(about 400,000 triples, but growing)
  • 36. TL-2 and LOD ChallengesPerformance of Drupal Import:Feeds Import: 7 Hours for 35,000 “Records” or Drupal NodesOther options? Still searching…Our linked data set will grow to at least 600-700k Drupalnodes.Is Drupal the best way to do this?
  • 37. Challenges• Errors in the Corrected OCR• Challenges in Parsing Citations• The 80/20 rule: manually making connectionsunable to be made by automated means• Finding suitable sources of data to link to.(DBPedia? VIAF? EOL? Others?)
  • 38. Summary• This data may already exist online.• It may also not always be as accurate asneeded for science.• We are in a position to be the authoritativesource for this information.• Linked Data allows it to be easily reused andshared.
  • 39. Closing: something funOne example of reuseRyan Schenk http://synynyms.com/
  • 40. Closing: something funOne example of reuseRyan Schenk http://synynyms.com/
  • 41. Thank You!Unlocking Taxonomic Literature IIusing Linked Open DataJoel Richardrichardjm@si.edulibrary.si.edu/staff/joel-richardSpecial thanks toThe International Association for PlantTaxonomy, for giving uspermission to scan and digitizeTL-2 and place it online.For his advice and support, Dr. Laurence Dorr, Botanist andCurator, Department of Botany, Smithsonian National Museum of NaturalHistory.This project was partially funded by the Atherton Seidell EndowmentFund of the Smithsonian Institution.