Unlocking Taxonomic Literature II using Linked Open Data

Joel Richard, Smithsonian Libraries
Unlocking Taxonomic Literature II
using Linked Open Data

• What is Linked Open Data / The Semantic Web?
• Where can I see LOD in use?
• What is Taxonomic Literature II?
• How is it being converted to LOD?
• Did we encounter any challenges?
Agenda

Linked data
From Wikipedia, the free encyclopedia
A method of publishing structured data so that it can be
interlinked and become more useful. It builds upon
standard Web technologies … [and] extends them to
share information in a way that can be read
automatically by computers. This enables data from
different sources to be connected and queried.
What is Linked Open Data?
http://en.wikipedia.org/wiki/Linked_Open_Data

What is the Semantic Web?
Semantic Web
From Wikipedia, the free encycloped
A movement led by the World Wide Web Consortium… to
promote common data formats on the Web.
By encouraging the inclusion of semantic content in web
pages, the Semantic Web aims at converting the current
web dominated by unstructured and semi-structured
documents into a "web of data".
"The Semantic Web provides a common framework that
allows data to be shared and reused across
application, enterprise, and community boundaries."
http://en.wikipedia.org/wiki/Semantic_Web)

Five Stars of Linked Open Data
Available on the web (in any format) but with an open
license, to be Open Data.
Available as machine-readable structured data (e.g.
excel instead of image scan of a table.)
As (2) plus non-proprietary format (e.g. CSV instead of
Microsoft Excel.)
All the above plus, Use open standards from W3C (RDF
and SPARQL) to identify things, so that people can
point at your stuff.
All the above, plus: Link your data to other people’s
data to provide context.
★
★★
★★★
★★★★
★★★★★
http://www.w3.org/DesignIssues/LinkedData.html

LinkingOpen Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Charles Darwin
“Feb 12, 1809”
Shrewsbury
BornOn
Born In
City
England
Type
Is In
Person
Type
Country
Type
Charles Darwin “Feb 12, 1809”
BornOn
Identifier Predicate Identifier /Value
(subject) (verb/relationship) (object)
On the Origin
of Species
Author Of

Tim Berners-Lee outlined four principles
for linked open data:
1. Use URIs to denote things.
2. Use HTTP URIs so that these things can be
referred to and looked up ("dereferenced")
by people and user agents.
3. Provide useful information about the thing when its URI is
dereferenced, leveraging standards such as RDF, SPARQL.
4. Include links to other related things (using their URIs) when
publishing data on the Web.
http://www.w3.org/DesignIssues/LinkedData.html
http://5stardata.info/

http://dbpedia.org/
resource/Charles_Darwin
“Feb 12, 1809”
http://dbpedia.org/
resource/Shrewsbury
BornOn
Born In
City
http://dbpedia.org/
resource/United_Kingdom
Type
Is In
Person
Type
Country
Type
Identifier Predicate Identifier /Value
http://dbpedia.org/resource/
On_the_Origin_of_Species
Author Of
Predicate Identifier /Value

Predicate Vocabularies
• Dublin Core – General Metadata for Discovery
• SKOS – Simple Knowledge Organization System
• BIBO – Bibliographic Ontology
• BIO – Biographical
• FOAF – Friend of a Friend
• Events…
• Geographic…
• Many others!
• OWL – Web Ontology Language

Mondeca Labs
Linked Open
Vocabularies (LOV)
Vocabulary of a Friend
(VOAF)
A vocabulary for
describing other
vocabularies
http://labs.mondeca.com/dataset/lov

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .
@prefix dbpprop: <http://dbpedia.org/property/> .
<http://dbpedia.org/resource/Charles_Darwin>
rdf:type <http://xmlns.com/foaf/0.1/Person>;
rdf:type <http://dbpedia.org/ontology/Scientist>;
foaf:name “Charles Darwin”;
foaf:depiction “http://upload.wikimedia.org/…/Charles_Darwin_seated_crop.jpg”;
dbpedia-owl:field <http://dbpedia.org/resource/Natural_history>
dbpprop:placeOfBirth "Mount House, Shrewsbury, Shropshire, England”;
dbpedia-owl:birthDate "1809-02-12";
dbpedia-owl:birthPlace <http://dbpedia.org/resource/Shrewsbury>
dbpedia-owl:deathDate "1882-04-19";
dbpedia-owl:deathPlace <http://dbpedia.org/resource/Down_House>
dbpprop:awards <http://dbpedia.org/resource/Royal_Medal>

Benefits of Linked Open Data
• Disambiguation
• Connecting Relevant Content
• More visibility via Search
• Enrichment of your data
• Easier reuse of data

Linked Open Data in Use
Google Knowledge Graph

Congress: Linked Data Services
http://id.loc.gov/
Schema.org
http://www.schema.org
Data.gov / Semantic
http://www.data.gov/semantic
Linked Data.org
http://linkeddata.org/
Stephen Dale: Linked Data in Action
http://www.slideshare.net/stephendale/linked-data-in-action-4487244
Other LOD Examples and Information

Taxonomic Literature: A selective guide to botanical
publications and collections with dates, commentaries
and types. (Stafleu et al.)
Essential Reference
Tool for Botanists
Authors and their
Publications from
1753 to 1940
It is a “database in book form.”
Taxonomic Literature II

Scanned the pages.
Uploaded to the Internet Archive.
Hired contractor for OCR and correction (99.97%
accuracy.)
Received XML dataset from Contractor.
Verified and Imported to SQL Server Database.
Built a website to search the data.

First...what does 99.97% accuracy mean?
~12,000 Errors

1. Select Identifiers for our data
http://library.si.edu/digital-library/tl-2/author/darwin
http://library.si.edu/digital-library/tl-2/title/origin_of_species
http://library.si.edu/digital-library/tl-2/title/1313
2. Choose vocabularies for predicates (harder than it
sounds)
OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIB
O, etc.
3. Create Links to other data sources on the web.

Taxonomic Literature II as Linked Data
http://library.si.edu/tl2/author/darwin
http://library.si.edu/tl2/title/1313
tl2:creator <http://library.si.edu/tl2/title/1313>
owl:sameAs <http://viaf.org/viaf/27063124>
dc:creator <http://library.si.edu/tl2/author/darwin>
owl:sameAs http://www.archive.org/details/originofspecies00darwuoft
owl:sameAs <http://www.worldcat.org/oclc/425919213>
Select Identifiers

<http://library.si.edu/tl2/author/darwin>
rdf:type <http://xmlns.com/foaf/0.1/Person>
foaf:lastName “Darwin”
foaf:familyName “Darwin”
foaf:firstName “Charles”
foaf:givenName “Charles”
foaf:name “Darwin, Charles Robert”
skos:prefLabel “Darwin, Charles Robert”
bio:birth “1809”
bio:death “1882”
skos:defintion “British evolutionary biologist”
tl2:personAbbreviation “Darwin”
Select Identifiers:Authors

<http://library.si.edu/tl2/book/1313>
rdf:type <http://purl.org/ontology/bibo/Book>
tl2:titleNumber “1313”
tl2:titleAbbreviation “Origin sp.”
tl2:shortTitle “On the origin of species”
dc:title “On the origin of species by means of natural
selection, or the preservation of favoured races in the...”
dc:publisher “John Murray”
event:place “London”
dc:created “1859”
SelectVocabularies: Publications

Linking: Author Names
Used a combination of OpenRefine and LODRefine as well as
custom code.
Results: Mixed
• Matched 15 - 20% of the names in our sample set
• Some named weren’t high in the list and required a human touch
Conclusion: Computer code needs to be improved with the aim of
minimizing amount of staff or volunteer time spent matching
names.

Charles Darwin
(From the dbpedia.org)

Linking: Herbaria
Used computer code to split the herbarium names and identify
them in data provided by the Biodiversity Collections Index.
Results: Good
• Matched 95+% of the herbarium names in all ofTL-2
• Careful attention to “A” which is an herbarium, but also starts
some sentences in the HERBARIUM andTYPES blocks
Conclusion:These will be added toTL-2 when it is launches as LOD.

Missouri Botanical Garden Herbarium
(From the Biodiversity Collections Index)
Lsid urn:lsid:biocol.org:col:15859
Name Missouri Botanical Garden Herbarium
Code MO
Kind Herbarium
Taxon Scope Herbarium collection limited to vascular plants (5.6 million
specimens) and bryophytes (500,000 specimens), Jan. 2009.
Geo Scope Worldwide; phanerogams strong in Central America (especially
Costa Rica, Nicaragua, and Panama), tropical South America. . .
Size 6,150,000
FoundedYear 1859
Web Site http://www.mobot.org/
Location Street P.O. Box 299
Location City Saint Louis
Location State Missouri
Location Postcode 63166-0299
Location Country Iso US
http://www.biodiversitycollectionsindex.org/urn:lsid:biocol.org:col:15859

Taxonomic Literature II as LOD
How are we going to store all this?
We’re using Drupal – automatically embed some
Linked Open Data elements in the webpage.
Probably not a good idea for very large datasets.
TL-2 = 10,000 authors + 37,000 titles
(about 400,000 triples, but growing)

TL-2 and LOD Challenges
Performance of Drupal Import:
Feeds Import: 7 Hours for 35,000 “Records” or Drupal Nodes
Other options? Still searching…
Our linked data set will grow to at least 600-700k Drupal
nodes.
Is Drupal the best way to do this?

Challenges
• Errors in the Corrected OCR
• Challenges in Parsing Citations
• The 80/20 rule: manually making connections
unable to be made by automated means
• Finding suitable sources of data to link to.
(DBPedia? VIAF? EOL? Others?)

Summary
• This data may already exist online.
• It may also not always be as accurate as
needed for science.
• We are in a position to be the authoritative
source for this information.
• Linked Data allows it to be easily reused and
shared.

Closing: something fun
One example of reuse
Ryan Schenk http://synynyms.com/

Thank You!
Unlocking Taxonomic Literature II
using Linked Open Data
Joel Richard
richardjm@si.edu
library.si.edu/staff/joel-richard
Special thanks to
The International Association for PlantTaxonomy, for giving us
permission to scan and digitizeTL-2 and place it online.
For his advice and support, Dr. Laurence Dorr, Botanist and
Curator, Department of Botany, Smithsonian National Museum of Natural
History.
This project was partially funded by the Atherton Seidell Endowment
Fund of the Smithsonian Institution.

Unlocking Taxonomic Literature II using Linked Open Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Unlocking Taxonomic Literature II using Linked Open Data

Similar to Unlocking Taxonomic Literature II using Linked Open Data (20)

Recently uploaded

Recently uploaded (20)

Unlocking Taxonomic Literature II using Linked Open Data

Editor's Notes