The Smithsonian Libraries has digitized Taxonomic Literature II, an essential research tool for Botanists. This presentation, with audio, starts with a description of Linked Data, a history of TL-2 and some of the methods and challenges we are encountering as we convert it to an digital version and Linked Open Data.
https://doi.org/10.6084/m9.figshare.11854626.v1
Presented at Dutch National Librarian/Information Professianal Association annual conference 2011 - NVB2011
November 17, 2011
https://doi.org/10.6084/m9.figshare.11854626.v1
Presented at Dutch National Librarian/Information Professianal Association annual conference 2011 - NVB2011
November 17, 2011
Cultural Heritage Insitutions and Big Data Collectionslljohnston
Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. We have Big Data in our Libraries, Archives and Museums, and we and managing and preserving those collections for research use. Preservation given at the 2013 Wolfram Data Summit.
Linked Open Data and Systematic TaxonomyJoel Richard
A short talk in which I briefly discuss the Smithsonian Libraries' plans for Linked Open Data related to our Taxonomic Literature II and Index Animalium digitization projects.
The increase in online and web-only publishing has made it easier for organisations to create and distribute grey literature. Use these tips and tricks to track it down.
A followup on our 2011 presentation on the new Linked Open Digital Library, discussing how we are creating a digital library centered around LInked Open Data. Include details on how we are creating a dataset of botanists and their publications that is to be shared as linked open data.
Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a yeoman's miscellany, and nonesuch guide to Linked Data, especially as it relates to libraries, archives, and museums. American Association of Museums Meeting. Minneapolis, MN. 2 May 2012.
Cultural Heritage Insitutions and Big Data Collectionslljohnston
Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. We have Big Data in our Libraries, Archives and Museums, and we and managing and preserving those collections for research use. Preservation given at the 2013 Wolfram Data Summit.
Linked Open Data and Systematic TaxonomyJoel Richard
A short talk in which I briefly discuss the Smithsonian Libraries' plans for Linked Open Data related to our Taxonomic Literature II and Index Animalium digitization projects.
The increase in online and web-only publishing has made it easier for organisations to create and distribute grey literature. Use these tips and tricks to track it down.
A followup on our 2011 presentation on the new Linked Open Digital Library, discussing how we are creating a digital library centered around LInked Open Data. Include details on how we are creating a dataset of botanists and their publications that is to be shared as linked open data.
Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a yeoman's miscellany, and nonesuch guide to Linked Data, especially as it relates to libraries, archives, and museums. American Association of Museums Meeting. Minneapolis, MN. 2 May 2012.
Of Metaphors and Metadata: The Importance of Metadata for the Collections of the Diplomatic Reception Rooms U.S. Department of State. Martin R. Kalfatovic. U.S. Department of State. Washington, DC. 24 May 2012.
The Wonderful Technicolor World Digital Goodness @ Smithsonian LibrariesMartin Kalfatovic
The Wonderful Technicolor World Digital Goodness @ Smithsonian Libraries (which sometimes appears in glorious archival black and white). Martin R. Kalfatovic. Digital Programs Advisory Committee, Smithsonian Institution. Washington, DC. 22 March 2012.
Using the conversion of Taxonomic Literature II as an example, I discuss in this high-level presentation some things to keep in mind while creating a linked open data set.
Also I present a few examples and links to LOD data sets and more information.
The Nature of Illumination: Cultural Heritage and the Technology of Culture.Martin Kalfatovic
The Nature of Illumination: Cultural Heritage and the Technology of Culture. Martin R. Kalfatovic.Cultural Heritage Information Management Forum. The Catholic University of America. Washington, DC. 5 June 2015
The Biodiversity Heritage Library. 10+1 and Beyond: Looking ForwardMartin Kalfatovic
The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward. Martin R. Kalfatovic. BHL Day 2016, Natural History Museum. London, 12 April 2016.
Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage ...Martin Kalfatovic
Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage Library. Martin R. Kalfatovic and Constance Rinaldo. Shanghai International Library Forum 2016. Shanghai, China. 8 July 2016.
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
Thinking about your sales team's goals for 2017? Drift's VP of Sales shares 3 things you can do to improve conversion rates and drive more revenue.
Read the full story on the Drift blog here: http://blog.drift.com/sales-team-tips
How to Become a Thought Leader in Your NicheLeslie Samuel
Are bloggers thought leaders? Here are some tips on how you can become one. Provide great value, put awesome content out there on a regular basis, and help others.
An introduction to the Joint Information Systems Committee Resource Discovery iKit. Includes a look at controlled vocabularies declared in the Resource Discovery Framework (RDF)/Simple Knowledge Organisation System (SKOS) and wikipedia entries. Presented by Tony Ross at the CILIPS Centenary Conference Branch and Group Day which took place 5 Jun 2008.
Brief overview of linked data and RDF followed by use in libraries and archives. Originally delivered at OLITA Digital Odyssey 2014. Revised for the OLA Superconference 2015
Building the new open linked library: Theory and PracticeTrish Rose-Sandler
What tools and services are necessary to build an open linked library and how can we move existing digital library content into an open linked data model and use those tools to repurpose our own content?
Providing open data is of interest for its societal and commercial value, for transparency, and because more people can do fun things with data. There is a growing number of initiatives to provide open data, from, for example, the UK government and the World Bank. However, much of this data is provided in formats such as Excel files, or even PDF files. This raises the question of
- How best to provide access to data so it can be most easily reused?
- How to enable the discovery of relevant data within the multitude of available data sets?
- How to enable applications to integrate data from large numbers of formerly unknown data sources?
One way to address these issues to to use the design principles of linked data (http://www.w3.org/DesignIssues/LinkedData.html), which suggest best practices for how to publish and connect structured data on the Web. This presentation gives an overview of linked data technologies (such as RDF and SPARQL), examples of how they can be used, as well as some starting points for people who want to provide and use linked data.
The presentation was given on August 8, at the Hacknight event (http://hacknight.se/) of Forskningsavdelningen (http://forskningsavd.se/) (Swedish: “Research Department”) a hackerspace in Malmö.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
2. • What is Linked Open Data / The Semantic Web?
• Where can I see LOD in use?
• What is Taxonomic Literature II?
• How is it being converted to LOD?
• Did we encounter any challenges?
Agenda
3. Linked data
From Wikipedia, the free encyclopedia
A method of publishing structured data so that it can be
interlinked and become more useful. It builds upon
standard Web technologies … [and] extends them to
share information in a way that can be read
automatically by computers. This enables data from
different sources to be connected and queried.
What is Linked Open Data?
http://en.wikipedia.org/wiki/Linked_Open_Data
4. What is the Semantic Web?
Semantic Web
From Wikipedia, the free encycloped
A movement led by the World Wide Web Consortium… to
promote common data formats on the Web.
By encouraging the inclusion of semantic content in web
pages, the Semantic Web aims at converting the current
web dominated by unstructured and semi-structured
documents into a "web of data".
"The Semantic Web provides a common framework that
allows data to be shared and reused across
application, enterprise, and community boundaries."
http://en.wikipedia.org/wiki/Semantic_Web)
5. Five Stars of Linked Open Data
Available on the web (in any format) but with an open
license, to be Open Data.
Available as machine-readable structured data (e.g.
excel instead of image scan of a table.)
As (2) plus non-proprietary format (e.g. CSV instead of
Microsoft Excel.)
All the above plus, Use open standards from W3C (RDF
and SPARQL) to identify things, so that people can
point at your stuff.
All the above, plus: Link your data to other people’s
data to provide context.
What is Linked Open Data?
★
★★
★★★
★★★★
★★★★★
http://www.w3.org/DesignIssues/LinkedData.html
6. What is Linked Open Data?
LinkingOpen Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
7. What is Linked Open Data?
Charles Darwin
“Feb 12, 1809”
Shrewsbury
BornOn
Born In
City
England
Type
Is In
Person
Type
Country
Type
Charles Darwin “Feb 12, 1809”
BornOn
Identifier Predicate Identifier /Value
(subject) (verb/relationship) (object)
On the Origin
of Species
Author Of
8. Tim Berners-Lee outlined four principles
for linked open data:
1. Use URIs to denote things.
2. Use HTTP URIs so that these things can be
referred to and looked up ("dereferenced")
by people and user agents.
3. Provide useful information about the thing when its URI is
dereferenced, leveraging standards such as RDF, SPARQL.
4. Include links to other related things (using their URIs) when
publishing data on the Web.
What is Linked Open Data?
http://www.w3.org/DesignIssues/LinkedData.html
http://5stardata.info/
9. What is Linked Open Data?
http://dbpedia.org/
resource/Charles_Darwin
“Feb 12, 1809”
http://dbpedia.org/
resource/Shrewsbury
BornOn
Born In
City
http://dbpedia.org/
resource/United_Kingdom
Type
Is In
Person
Type
Country
Type
Identifier Predicate Identifier /Value
http://dbpedia.org/resource/
On_the_Origin_of_Species
Author Of
Predicate Identifier /Value
10. What is Linked Open Data?
Predicate Vocabularies
• Dublin Core – General Metadata for Discovery
• SKOS – Simple Knowledge Organization System
• BIBO – Bibliographic Ontology
• BIO – Biographical
• FOAF – Friend of a Friend
• Events…
• Geographic…
• Many others!
• OWL – Web Ontology Language
11. What is Linked Open Data?
Mondeca Labs
Linked Open
Vocabularies (LOV)
Vocabulary of a Friend
(VOAF)
A vocabulary for
describing other
vocabularies
http://labs.mondeca.com/dataset/lov
13. What is Linked Open Data?
Benefits of Linked Open Data
• Disambiguation
• Connecting Relevant Content
• More visibility via Search
• Enrichment of your data
• Easier reuse of data
17. Congress: Linked Data Services
http://id.loc.gov/
Schema.org
http://www.schema.org
Data.gov / Semantic
http://www.data.gov/semantic
Linked Data.org
http://linkeddata.org/
Stephen Dale: Linked Data in Action
http://www.slideshare.net/stephendale/linked-data-in-action-4487244
Other LOD Examples and Information
18. Taxonomic Literature: A selective guide to botanical
publications and collections with dates, commentaries
and types. (Stafleu et al.)
Essential Reference
Tool for Botanists
Authors and their
Publications from
1753 to 1940
It is a “database in book form.”
Taxonomic Literature II
24. Scanned the pages.
Uploaded to the Internet Archive.
Hired contractor for OCR and correction (99.97%
accuracy.)
Received XML dataset from Contractor.
Verified and Imported to SQL Server Database.
Built a website to search the data.
Taxonomic Literature II
27. 1. Select Identifiers for our data
http://library.si.edu/digital-library/tl-2/author/darwin
http://library.si.edu/digital-library/tl-2/title/origin_of_species
http://library.si.edu/digital-library/tl-2/title/1313
2. Choose vocabularies for predicates (harder than it
sounds)
OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIB
O, etc.
3. Create Links to other data sources on the web.
Taxonomic Literature II
28. Taxonomic Literature II as Linked Data
http://library.si.edu/tl2/author/darwin
http://library.si.edu/tl2/title/1313
tl2:creator <http://library.si.edu/tl2/title/1313>
owl:sameAs <http://viaf.org/viaf/27063124>
dc:creator <http://library.si.edu/tl2/author/darwin>
owl:sameAs http://www.archive.org/details/originofspecies00darwuoft
owl:sameAs <http://www.worldcat.org/oclc/425919213>
Select Identifiers
29. Taxonomic Literature II as Linked Data
<http://library.si.edu/tl2/author/darwin>
rdf:type <http://xmlns.com/foaf/0.1/Person>
foaf:lastName “Darwin”
foaf:familyName “Darwin”
foaf:firstName “Charles”
foaf:givenName “Charles”
foaf:name “Darwin, Charles Robert”
skos:prefLabel “Darwin, Charles Robert”
bio:birth “1809”
bio:death “1882”
skos:defintion “British evolutionary biologist”
tl2:personAbbreviation “Darwin”
Select Identifiers:Authors
30. Taxonomic Literature II as Linked Data
<http://library.si.edu/tl2/book/1313>
rdf:type <http://purl.org/ontology/bibo/Book>
tl2:titleNumber “1313”
tl2:titleAbbreviation “Origin sp.”
tl2:shortTitle “On the origin of species”
dc:title “On the origin of species by means of natural
selection, or the preservation of favoured races in the...”
dc:publisher “John Murray”
event:place “London”
dc:created “1859”
SelectVocabularies: Publications
31. Taxonomic Literature II as Linked Data
Linking: Author Names
Used a combination of OpenRefine and LODRefine as well as
custom code.
Results: Mixed
• Matched 15 - 20% of the names in our sample set
• Some named weren’t high in the list and required a human touch
Conclusion: Computer code needs to be improved with the aim of
minimizing amount of staff or volunteer time spent matching
names.
33. Taxonomic Literature II as Linked Data
Linking: Herbaria
Used computer code to split the herbarium names and identify
them in data provided by the Biodiversity Collections Index.
Results: Good
• Matched 95+% of the herbarium names in all ofTL-2
• Careful attention to “A” which is an herbarium, but also starts
some sentences in the HERBARIUM andTYPES blocks
Conclusion:These will be added toTL-2 when it is launches as LOD.
34. Taxonomic Literature II
Missouri Botanical Garden Herbarium
(From the Biodiversity Collections Index)
Lsid urn:lsid:biocol.org:col:15859
Name Missouri Botanical Garden Herbarium
Code MO
Kind Herbarium
Taxon Scope Herbarium collection limited to vascular plants (5.6 million
specimens) and bryophytes (500,000 specimens), Jan. 2009.
Geo Scope Worldwide; phanerogams strong in Central America (especially
Costa Rica, Nicaragua, and Panama), tropical South America. . .
Size 6,150,000
FoundedYear 1859
Web Site http://www.mobot.org/
Location Street P.O. Box 299
Location City Saint Louis
Location State Missouri
Location Postcode 63166-0299
Location Country Iso US
http://www.biodiversitycollectionsindex.org/urn:lsid:biocol.org:col:15859
35. Taxonomic Literature II as LOD
How are we going to store all this?
We’re using Drupal – automatically embed some
Linked Open Data elements in the webpage.
Probably not a good idea for very large datasets.
TL-2 = 10,000 authors + 37,000 titles
(about 400,000 triples, but growing)
36. TL-2 and LOD Challenges
Performance of Drupal Import:
Feeds Import: 7 Hours for 35,000 “Records” or Drupal Nodes
Other options? Still searching…
Our linked data set will grow to at least 600-700k Drupal
nodes.
Is Drupal the best way to do this?
37. Challenges
• Errors in the Corrected OCR
• Challenges in Parsing Citations
• The 80/20 rule: manually making connections
unable to be made by automated means
• Finding suitable sources of data to link to.
(DBPedia? VIAF? EOL? Others?)
38. Summary
• This data may already exist online.
• It may also not always be as accurate as
needed for science.
• We are in a position to be the authoritative
source for this information.
• Linked Data allows it to be easily reused and
shared.
41. Thank You!
Unlocking Taxonomic Literature II
using Linked Open Data
Joel Richard
richardjm@si.edu
library.si.edu/staff/joel-richard
Special thanks to
The International Association for PlantTaxonomy, for giving us
permission to scan and digitizeTL-2 and place it online.
For his advice and support, Dr. Laurence Dorr, Botanist and
Curator, Department of Botany, Smithsonian National Museum of Natural
History.
This project was partially funded by the Atherton Seidell Endowment
Fund of the Smithsonian Institution.
Editor's Notes
This is a quick demonstration of how linked data has grown over the past five years. Back in 2007 we had only a handful of data sets, at least according to Richard Cyganiak’s searching. Between 2009 and 2010 the number of items doubles. As of Sept 2011 there are 295 data sets listed. There are more today and more being added every day.Not all data sets are represented here, so this is only a sample of what’s available. The actual graph could be four or five times larger by now.What’s the point? This is all data that has the potential to enhance YOUR data. This is all linked data. This is all open data.
The basic unit of LOD is the “triple” made up of three elements. An identifier, a predicate and another identifier or a value of some kind. Think of it as a sentence: Subject-verb-object. The underlined blue text indicate that this is an identifier that can be linked to on the web. The first part of the triple is always an identifier. The third part is sometimes an identifier but should be if an identifier exists.When we repeat these connections, we start to create a web of networked data.
Looking back, we can see that Tim Berners Lee has mapped out these four principles that make up the foundation of linked data, which also give it structure and make it easy to use.
Going back to our web of data, we can now represent the identifiers as identifiers.The next question is: where do we get the predicates from? Why are they important?
There are numerous vocabularies of predicates that we can use when developing our linked open data. (Describe them more in detail, leading into the next slide)
Wow, look at al of them! Mondeca labs has collected and classified all the vocabularies they can find. There are 350 vocabularies listed here.
Here is an example of some linked data in a reasonably human-readable form. We have some prefix definitions of the predicate vocabularies we are using. Then we have the identifier in green, and the predicates in blue. Values are in black with identifiers enclosed in greater-than and less-than signs.
What are the benefits of LOD?
Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambiguate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambigutate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambiguate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
Here are some more examples of places you can go for linked data. The Library of Congress has a linked data services for their authorities and vocabularies. Schema.org is being used within webpages to improve their visibility and search results. The US Government is offering a lot of data, some of it in linked data. LinkedData.org is a place to go to learn about all things linked data and finally, Stephen Dale, a knowledge management consultant, has a great presentation with examples of linked data in use to learn more than we knew before.
Overall, TL-2 provides the most comprehensive biographical and bibliographical analysis for systematic botany literature published between 1753 and 1940 to date.
Here is a page from TL-2. It’s hard to read. Let’s zoom in a bit.
When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
Continuing our zooming… This includes some additional information that we know about Charles Darwin, including places where we can find known samples of his handwriting, species that were named for him and even postage stamps that honor him.
Continuing our zooming… Here we see three publications by Darwin giving a number of the book, the title and publication information.
The things that make TL-2 important are the unique abbreviations of the author names. e.g. “Darwin” outlined in Green. Also significant are the abbreviations of the titles of the publications, also outlined in green (“Srigin sp.”), but not all publications have titles. In red are the book numbers, also unique across all 37,000 publications. Finally, we have the “short title” of the volumes which is outlined in blue.
Briefly this was out process to create the data. In Jan 2011, we scanned the books and placed them online at the Internet Archive. Later, after selecting a contractor, we sent the scans and the OCR text (created at the Internet Archive) to a contractor who ultimately created a 99.97% accurate text version of TL-2. They then parsed that data to a limited degree and delivered to us an XML dataset that we then imported to a SQL Server database.Finally, we created a searchable, browseable website to access the TL-2 data, opening it up to researchers around the world. Two of them use it on a regular basis. (rimshot!) In reality in a month, we get about 500 people visiting and 6000 pageviews, with about 60% of those coming from outside of the U.S.
This is the current website that we have that shows a sample of the search results for Charles Darwin. This is not Linked Data. This page got approximately 860 visitors and 1500 visits in the month of April 2013. Which is twice the number of visitors we got in April 2012. We actually get more visits from Europe than from North America. You can find this page at: http://www.sil.si.edu/digitalcollections/tl-2/
Earlier we mentioned 99.97% accuracy. This means that if we assume 38 million characters in all of TL-2 that there are upwards of 12,000 errors in our text. (In reality this is more like 5,000-6,000 due to the nature of our data)This may not be bad for the textual components of the content, but when it comes to parsing citations or more structured information, this will prove to be a challenge. Other data sets may not have this problem, but as we are scanning and converting to text, this something that will always be present for us.
This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
When we’ve zoomed in, we can see Darwin’s name, description, birth and death dates, and an abbreviation in parenthesis. We also have herbaria (libraries of plant samples) that he contributed to, and a brief note about his significance and how his works are greater than that which can be contained by TL-2.
As an exmaple, wikipedia has 3000 botanists in their database. We have 10,000 of them. We have the more complete, richer set of data that can be used to