This document discusses how linking data and vocabularies according to Semantic Web principles can help taxonomy professionals manage controlled vocabularies in today's interconnected information world. It argues that treating vocabularies as interoperable networks, rather than isolated systems, allows them to be rapidly adopted, reused and mapped across organizations. The document outlines several capabilities needed for linked data vocabulary management, such as vocabulary mapping, slicing, version management and modeling/conversions. Vocabulary mapping is discussed in more detail as an example capability.
An overview of the benefits of using both taxonomies and metadata to make your information easier to search. Presentation by Alice Redmond-Neal of Access Innovations, Inc.
Seven Ways Concept-Based Auto Categorization Tames Big DataSteven Toole
Concept-based auto categorization, which automatically categorizes documents based on their actual content, not keywords or terms, is a fast, easy, and repeatable way to pinpoint only the most important documents and e-mails among libraries spanning millions of files and messages. It is an established standard in legal e-discovery and U.S. intelligence, having proved itself defensible and highly scalable.
By using sample documents containing the concepts being sought, concept-based auto categorization "looks" across an organization's entire electronic content and finds others like them. Because it doesn't depend on finding key words or terms, it is faster, easier, and far more accurate than lexicon-based taxonomy alternatives.
With concept-based auto categorization, then, it's no longer necessary to manually create – and constantly maintain – word-based taxonomies and complex rules in order to precisely and accurately classify large volumes of unstructured big data and improve the "findability" of information.
Benefits of Accurate Categorization
In enterprise content management, increased categorization accuracy enables better content lifecycle management, improved sharing among internal and external audiences, more effective document and records management for disposal, retention, and compliance, and reduced exposure to the cost of future legal matters.
Concept-based auto categorization is now in its early stages of adoption to help tame big data, reducing the burden while simultaneously capitalizing on its hidden value. It does this by helping organizations:
1. Dispose of redundant, outdated and trivial (ROT) documents and e-mails.
2. Maintain archiving regulatory compliance.
3. Improve cross-functional, divisional, and external content sharing and collaboration.
4. Improve content lifecycle management amidst evolving terms and categories.
5. Integrate – and dis-integrate – content through mergers, acquisitions, and divestitures.
6. Improve security, privacy, and risk mitigation.
7. Auto categorize in any language.
Despite the hype around big data, few will disagree that it poses challenges, as well as benefits if managed properly, and fewer still will disagree that it's going away anytime soon. Throwing more bodies at the problem is simply not practical, as the volume, velocity, and variety of content comprising big data are accelerating.
Concept-based auto categorization has proven itself as a highly effective, extremely fast, and incredibly precise approach; the possibilities are endless for applying it to big data to address its major obstacles and to harvesting its broad benefits.
Semantic Web Technologies: Changing Bibliographic Descriptions?Stuart Weibel
Keynote presentation at the North Atlantic Health Science Library meeting, October 26, 2009.
An introduction to semantic web technologies and their relationship to libraries and bibliographic data.
Stuart Weibel, Senior Research Scientist, OCLC Research
An overview of the benefits of using both taxonomies and metadata to make your information easier to search. Presentation by Alice Redmond-Neal of Access Innovations, Inc.
Seven Ways Concept-Based Auto Categorization Tames Big DataSteven Toole
Concept-based auto categorization, which automatically categorizes documents based on their actual content, not keywords or terms, is a fast, easy, and repeatable way to pinpoint only the most important documents and e-mails among libraries spanning millions of files and messages. It is an established standard in legal e-discovery and U.S. intelligence, having proved itself defensible and highly scalable.
By using sample documents containing the concepts being sought, concept-based auto categorization "looks" across an organization's entire electronic content and finds others like them. Because it doesn't depend on finding key words or terms, it is faster, easier, and far more accurate than lexicon-based taxonomy alternatives.
With concept-based auto categorization, then, it's no longer necessary to manually create – and constantly maintain – word-based taxonomies and complex rules in order to precisely and accurately classify large volumes of unstructured big data and improve the "findability" of information.
Benefits of Accurate Categorization
In enterprise content management, increased categorization accuracy enables better content lifecycle management, improved sharing among internal and external audiences, more effective document and records management for disposal, retention, and compliance, and reduced exposure to the cost of future legal matters.
Concept-based auto categorization is now in its early stages of adoption to help tame big data, reducing the burden while simultaneously capitalizing on its hidden value. It does this by helping organizations:
1. Dispose of redundant, outdated and trivial (ROT) documents and e-mails.
2. Maintain archiving regulatory compliance.
3. Improve cross-functional, divisional, and external content sharing and collaboration.
4. Improve content lifecycle management amidst evolving terms and categories.
5. Integrate – and dis-integrate – content through mergers, acquisitions, and divestitures.
6. Improve security, privacy, and risk mitigation.
7. Auto categorize in any language.
Despite the hype around big data, few will disagree that it poses challenges, as well as benefits if managed properly, and fewer still will disagree that it's going away anytime soon. Throwing more bodies at the problem is simply not practical, as the volume, velocity, and variety of content comprising big data are accelerating.
Concept-based auto categorization has proven itself as a highly effective, extremely fast, and incredibly precise approach; the possibilities are endless for applying it to big data to address its major obstacles and to harvesting its broad benefits.
Semantic Web Technologies: Changing Bibliographic Descriptions?Stuart Weibel
Keynote presentation at the North Atlantic Health Science Library meeting, October 26, 2009.
An introduction to semantic web technologies and their relationship to libraries and bibliographic data.
Stuart Weibel, Senior Research Scientist, OCLC Research
This presentation was provided by Tim McGeary of Duke University during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
Starfish is a network approach to knowledge sharing. Founded by the science faculty of the University of Amsterdam, Starfish is driven by educational innovators to support teachers and connect them with useful resources. Starfish sets out to become a community-driven platform where everybody can contribute. Whether for a specific community, such as a faculty or project group, or for the entire world to benefit.
The volume of information your company creates and stores digitally is growing every day. Every business system in your organization creates or uses some type of data, making the job of records management challenging. If you had to manage records based only on the contents of the record itself, it would be very difficult. Likewise, searching for information across the organization would also be challenging. This is where metadata plays an important role in records management.
Simple Knowledge Organization System (SKOS) in the Context of Semantic Web De...gardensofmeaning
Links are valuable. Links between documents, between people, between ideas, between data. Data is now a first class Web citizen, and the Web is expanding as more of these valuable networks are deployed within its fabric. Well-established knowledge organization systems like the Library of Congress Subject Headings will play a major role within these networks, as hubs, connecting people with information and providing a firm foundation for network growth as many new routes to the discovery of information emerge through the collective action of individuals. Or will they?
This talk introduces the Simple Knowledge Organization System (SKOS), a soon-to-be-completed W3C standard for publishing thesauri, classification schemes and subject headings as linked data in the Web. This talk also presents SKOS in the context of the W3C’s Semantic Web Activity, and in particular the work of the W3C’s Semantic Web Deployment Working Group where other specifications are being developed for publishing linked data in the Web, for embedding linked data in Web pages, and for managing Semantic Web vocabularies. Finally, this talk takes a mildly inquisitive look at the value propositions for linked data in the Web, and how LCSH might be deployed in the Web for better information discovery.
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisXanat V. Meza
Disclaimer: All original texts and images belong to their rightful owners.
Chapter 8 of the book "Bibliometrics and Citation Analysis" by Nicola de Bellis.
Content Management, Metadata and Semantic WebAmit Sheth
Keynote given at NetObjectDays conference, Erfurt, September 11, 2001.
One of the earliest keynotes discussing commercial semantic web technologies, semantic web applications (including semantic search, semantic targeting, semantic content management). Prof. Sheth started a Semantic Web company Taalee, Inc. in 1999 (Product was MediaAnywhere A/V search engine),that merged to become Voquette in 2001 (product was called SCORE), Semagix in 2004 (product was called Semagix Freedom), and then Fortent in 2006 (products included Know Your Customers). Additional details can be found in U.S. Patent #6311194, 30 Oct. 2001 (filed 2000).
Note: the commercial system used "WorldModel" as at the time, business customers were not yet warm to "Ontology" - the concept/intent is the same. More recent information at http://knoesis.org
This presentation was provided by Tim McGeary of Duke University during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
Starfish is a network approach to knowledge sharing. Founded by the science faculty of the University of Amsterdam, Starfish is driven by educational innovators to support teachers and connect them with useful resources. Starfish sets out to become a community-driven platform where everybody can contribute. Whether for a specific community, such as a faculty or project group, or for the entire world to benefit.
The volume of information your company creates and stores digitally is growing every day. Every business system in your organization creates or uses some type of data, making the job of records management challenging. If you had to manage records based only on the contents of the record itself, it would be very difficult. Likewise, searching for information across the organization would also be challenging. This is where metadata plays an important role in records management.
Simple Knowledge Organization System (SKOS) in the Context of Semantic Web De...gardensofmeaning
Links are valuable. Links between documents, between people, between ideas, between data. Data is now a first class Web citizen, and the Web is expanding as more of these valuable networks are deployed within its fabric. Well-established knowledge organization systems like the Library of Congress Subject Headings will play a major role within these networks, as hubs, connecting people with information and providing a firm foundation for network growth as many new routes to the discovery of information emerge through the collective action of individuals. Or will they?
This talk introduces the Simple Knowledge Organization System (SKOS), a soon-to-be-completed W3C standard for publishing thesauri, classification schemes and subject headings as linked data in the Web. This talk also presents SKOS in the context of the W3C’s Semantic Web Activity, and in particular the work of the W3C’s Semantic Web Deployment Working Group where other specifications are being developed for publishing linked data in the Web, for embedding linked data in Web pages, and for managing Semantic Web vocabularies. Finally, this talk takes a mildly inquisitive look at the value propositions for linked data in the Web, and how LCSH might be deployed in the Web for better information discovery.
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisXanat V. Meza
Disclaimer: All original texts and images belong to their rightful owners.
Chapter 8 of the book "Bibliometrics and Citation Analysis" by Nicola de Bellis.
Content Management, Metadata and Semantic WebAmit Sheth
Keynote given at NetObjectDays conference, Erfurt, September 11, 2001.
One of the earliest keynotes discussing commercial semantic web technologies, semantic web applications (including semantic search, semantic targeting, semantic content management). Prof. Sheth started a Semantic Web company Taalee, Inc. in 1999 (Product was MediaAnywhere A/V search engine),that merged to become Voquette in 2001 (product was called SCORE), Semagix in 2004 (product was called Semagix Freedom), and then Fortent in 2006 (products included Know Your Customers). Additional details can be found in U.S. Patent #6311194, 30 Oct. 2001 (filed 2000).
Note: the commercial system used "WorldModel" as at the time, business customers were not yet warm to "Ontology" - the concept/intent is the same. More recent information at http://knoesis.org
Vocabulary Interoperability using Linked Data : Principles, Capabilities and ...James R. Morris
Managing unconnected silos of information, including taxonomies, will not harness the power of a networked, collaborative world. Vocabulary interoperability, enabled by the Linked Data principles of the Semantic Web will.
For version with notes, please contact me.
eLearning: Ipotesi progettuale di una rete di learning pointDomenico Capano
Aggiornamento di alcuni segmenti della Formazione Professionale della Provincia di Viterbo attraverso implementazione della FaD con metodiche
E-Learning e Blended Learning based
1. Analisi dei Fabbisogni del Settore della Formazione Professionale nella FaD
2. Ipotesi progettuali
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
Heather Hedden, Senior Consultant at Enterprise Knowledge, presented “The Role of Taxonomy and Ontology in Semantic Layers” at a webinar hosted by Progress Semaphore on April 16, 2024.
Taxonomies at their core enable effective tagging and retrieval of content, and combined with ontologies they extend to the management and understanding of related data. There are even greater benefits of taxonomies and ontologies to enhance your enterprise information architecture when applying them to a semantic layer. A survey by DBP-Institute found that enterprises using a semantic layer see their business outcomes improve by four times, while reducing their data and analytics costs. Extending taxonomies to a semantic layer can be a game-changing solution, allowing you to connect information silos, alleviate knowledge gaps, and derive new insights.
Hedden, who specializes in taxonomy design and implementation, presented how the value of taxonomies shouldn’t reside in silos but be integrated with ontologies into a semantic layer.
Learn about:
- The essence and purpose of taxonomies and ontologies in information and knowledge management;
- Advantages of semantic layers leveraging organizational taxonomies; and
- Components and approaches to creating a semantic layer, including the integration of taxonomies and ontologies
The whitepaper addresses the challenges in the data–driven organizations, medical research and health care. It summarizes how the context-enabled and semantic enrichment can transform the traditional method to search optimum data. 3RDi has advanced content enrichment with Named Entity Recognition, Semantic similarity, Content classification and Content summarization. Get the right data at the right time that helps medical researchers and health care practitioners.
This presentation, part of our Redefining the IT Lexicon (applied innovation series), introduces a new term related to the field of Semantic technology. That term is Semantic Intelligence. This briefing was provided by Semantech's InnovationWorx division.
Looking Under the Hood -- Australia SharePoint ConferenceChristian Buckley
"Looking Under the Hood: How Your Metadata Strategy Impacts Everything You Do" was presented on 3/9/2011 at the Australia SharePoint Conference in Sydney.
This presentation (part of Semantech InnovationWorx's Redefining IT series) explores what the next generation of Content Management and Search Engines will look like and what we need to do to reach intelligent computing...
Taxonomy is a useful way to categorize your content so you can find it, re-use it, and help users browse and search through your content. It also helps with search engine optimization.
Similar to Vocabulary interoperability in the semantic web james r morris (20)
Vocabulary interoperability in the semantic web james r morris
1. Vocabulary Interoperability in the Semantic Web
Why Linked Data will Transform the Taxonomy Profession
James R Morris
Associate Principle Information Scientist
AstraZeneca
Oct 21, 2013
This article was originally published in Taxonomy Times : Bulletin of the SLA Taxonomy Division, issue 16
(October 2013). Visit http://taxonomy.sla.org to learn more about becoming a member.
Introduction: why this matters
The management of controlled vocabularies is a traditional librarian specialty. These vocabularies, these
networks of concepts, that librarians have created, managed, and used for decades, are more important than
ever in today’s highly networked and complex information world. (This article uses the term "vocabularies"
to refer to any type of controlled and structured collection of terms or concepts: thesauri, taxonomies,
ontologies, glossaries, etc.)
We live in a networked world--a “linked” world. The World Wide Web is more than a technology; it’s a fact of
life. No one entity controls the content on the web, and that’s a good thing. In the same way, we can’t control
all the information and data that we might need to do our jobs. We can’t import it all, we can’t recreate it all,
and we can't buy it all. More critically, we can't predict what information will help us answer our questions
until we ask.
Vocabularies (taxonomies, thesauri, etc.), as every taxonomist knows, are essential for finding, navigating,
and connecting the information or data we need. But just like the rest of the information we need, we don’t
control those either (unless we own them). We can’t import them all and we can’t buy them all. And we can’t
recreate them all--which is often what we end up doing.
To continue to build our profession's capability in building, managing, and using taxonomies, thesauri and
other vocabularies, we need an approach that recognizes the reality of today’s highly networked information
world. This approach includes Linked Data principles and the technology of the Semantic Web, which are
designed to harness the inherent open-endedness of the World Wide Web. It's the right platform on which to
continue building our professional expertise in vocabulary management.
The opportunities that this field presents to our profession are very exciting. We need more discussions about
Linked Data vocabulary management among librarians and information scientists. To an equal extent, the
values, skills, and experiences of librarians and information scientists are needed to make this emerging field
successful.
Background: The “hyper-connected enterprise”
2. The phrase “hyper-connected enterprise” was highlighted in a 2007 Gartner report, “Gartner Predicts 2008:
CIO Success Factors” (http://www.gartner.com/id=564014. License required). Many organizations are
moving towards that model and the progression usually looks like this:
1. One Enterprise. It starts with a very self-contained organization, with ready access to all the information
it needs.
2. Connected Enterprise. Then increasing number of suppliers, outsourced providers, and new partners
become part of how the organization operates day-to-day.
3. Hyper-connected Enterprise. Then, the relationship to these providers and partners begins to evolve.
Positions of the players shift, they come and go more rapidly, and boundaries between the center and
periphery begin to blur.
To use Gartner's language, "Ecosystems widen, Complexity rises, Coordination increases.” The bottom line is
that coordination is the key--not control. Of course your own information must be well controlled, but you
cannot succeed with that alone.
As mentioned earlier, this is about more than a hyper-connected enterprise--it’s about a hyper-connected
world. It’s an open webbed world; we can’t predict what information we’ll need to answer--or even ask--our
questions. But in this world how do we effectively combine resources and work with others to ultimately
achieve our goals?
Vocabulary Interoperability and the Semantic Web
The Semantic Web is about building a data architecture that enables us to ask questions and solve business
problems across a heterogeneous information landscape extending beyond the traditional boundaries of one
organization. Traditionally, mainly professional indexers and searchers have used controlled vocabularies.
But today they are often used in a text-analytics or informatics context:
Adding structure to unstructured text; essentially tagging unstructured content with vocabulary term
identifiers, by looking for strings that match a wide-array of synonyms, and by other more sophisticated
algorithms.
Adding additional structure to already indexed content so that we can organize information in ways
specific to our requirements.
Adding vocabulary metadata to the content or to a search index to connect information across
repositories of information by connecting the concepts embedded in that information.
Let’s review the hyper-connected enterprise again, this time in the context of controlled vocabularies:
1. One Enterprise. For one organization, with good control of their information, an enterprise taxonomy,
or set of enterprise vocabularies can be created, and their usage enforced. A governance structure for
3. maintaining it can be established. Enterprise taxonomies and master data management is still
important--companies do need to manage their own information well.
2. Connected Enterprise: To use external information sources, use their native vocabularies. Deliver
information to outsourced providers and/or regulatory agencies requires the use of certain
vocabularies. For example, in the pharmaceutical and healthcare field, MedDRA, ICD-9-CM, ICD-10,
SNOMED are a few common ones. In the case of a merger or strategic alliance, another company
might have their own enterprise taxonomy or product thesaurus that can be integrated with yours
without too high a cost.
3. Hyper-connected enterprise: Other partners, universities, companies, divisions, and subsidiaries
connected to the enterprise are spread across the globe, and they come and go with increasing
frequency. Embarking on a new taxonomy integration project for every partner that appears in the
network is not feasible.
A small biotech might say, “Enterprise vocabulary? We’re too small – we just use terms that we
need; it’s not complicated. You're too bureaucratic.”
A large corporate partner or regulatory agency might say, “Over here, yes we follow standard
vocabularies. They’re not the same as your standards, but…”
An information supplier or database provider could say, “Our proprietary databases use specially
designed vocabs optimized for our software. You should really be using our software, you know.
We have the information need.”
Traditional solutions to taxonomy development and management fall short of meeting the challenges of the
hyper-connected enterprise. Semantic Web principles and technology offer a new approach to the
management of our thesauri, taxonomies, and ontologies--one of vocabulary interoperability. Treating
vocabularies as part of a network--just like the web itself—is needed. A network of controlled vocabularies to
facilitate data interoperability and query expansion across disparate structured and unstructured, internal
and external, information is part of the solution. A “network” implies that the vocabularies can be rapidly
adopted, reused, mapped, and versioned using standardized, sustainable technologies and approaches.
Principles
The following principles were drafted in my organization. They are presented as points for discussion within
the librarian/taxonomist community.
Don’t create a new vocabulary where one already exists. The preference is for authoritative,
trustworthy, well-managed sources. Apply standard library criteria used to evaluate resources:
authority, accuracy, currency, point of view (objectivity), coverage, relevance, and format.
Extend those sources if necessary, but even then they should be extended by making associations
with other authoritative sources. Whenever possible leverage cross-vocabulary mappings already
available publicly.
4. When necessary, extend or map an external vocabulary with internal or bespoke terminology. This
could include mapping obvious synonyms between vocabularies, or could entail more nuanced
relationships. For example, mapping a published vocabulary to a list of values from a proprietary or
internal database. By strategically connecting the concepts, the data is connected.
Do not corrupt authoritative vocabulary sources. It will always be possible to identify the source
vocabulary, including what version, in the network. It's just like a library. A page is not taken out of a
book and pasted it into another; sections are not crossed out here and added there to the point
where the books can’t be put back together again--that shouldn’t be done with vocabularies either.
However, using Linked Data standards, vocabularies can be repurposed. For example, perhaps a
deeply hierarchical taxonomy just needs to be rendered as flat list, or only some concepts are
relevant to the purpose at hand. Linked Data can enable these types of reuse without corrupting the
source.
Obtain vocabularies directly from the authority that produces them whenever practical. Vocabularies
will be managed as linked data. Preference is for the “SKOS” standard. SKOS stands for "Simple
Knowledge Organization System" and is a lightweight W3C data standard that can be applied to any
controlled vocabulary. Focusing on this simple standard allows us to standardize curation,
management, and delivery processes.
Capabilities
The following are a proposed set of linked data vocabulary management capabilities. They are presented as a
impetus for further discussion within the librarian/taxonomies community.
Vocabulary Mapping. Reusing, creating, and maintaining the relationships between vocabularies
holds the power of the “vocabulary network” at the heart of Linked Data Vocabulary Management.
Processes and technology need to be developed for lightweight, sustainable, semantics-based
vocabulary mapping. Challenges include:
o Mapping two or more vocabularies together using Semantic technology.
o Facilitating the review of automatic mapping; approving, rejecting, adjusting, and amending
mapping.
o Applying traditional taxonomy heuristics in a Linked Data environment as quality measures.
o Leveraging the mapping already done in published vocabularies.
o Publishing mapped “bridging” vocabularies to customers.
Vocabulary Slicing. Sometimes a vocabulary consumer (system or project) needs only a subset of a
vocabulary that is relevant to their needs; for example, only the terms that are of interest to a
particular project or organizational unit. Rather than create a duplicate, we need to identify terms or
branches of relevance, and make those available--all while retaining their link back to the authority,
5. so that updates to the authority can continue to be leveraged (new synonyms, new mappings, new
child terms, new notes).
Vocabulary Version Management. As the focus is the reuse of already published vocabularies,
successfully managing the impact of new versions on the network is critical. For example:
o Comparing a new version to an existing version.
o Identifying impact of vocabulary changes on the network; that is, to other vocabularies that
map to it.
o Identifying impact of vocabulary changes on consuming systems or related datasets. Then,
once we identify the impact, how do we address it?
Vocabulary modeling and conversions. The vocabularies in the network need to be accessible as
Linked Data, using the data standards of the Semantic Web. And, until external suppliers deliver or
make their vocabularies available as linked data, we will need to be extremely proficient in
converting them from their native formats.
Vocabulary Inventory. Managing a network of vocabularies requires a sophisticated inventory. Ideally
managed as Linked Data itself, the inventory can directly inform queries using the vocabulary
network. For example, identify all the vocabularies that cover concept X, the datasets that reference
those vocabularies, and how to access them. There are developing standards in the Semantic Web
specifically for managing these types of inventories.
Vocabulary Mapping using Linked Data--in more detail.
For this article, just one of these capabilities, Vocabulary Mapping, will be explored in more detail. How, and
why, would two vocabularies be brought together using Linked Data? First, a few Semantic Web
fundamentals need to be covered.
The Semantic Web is a web of data--not pages.
Pages need to be read by a person, or a program mimicking a person, in order to obtain the meaning of the
data on the page. For example, if we want to learn the times that local bars offer “happy hour”, we need to find
the web page for each bar, scan pages, follow links, and finally read content that says something like “we have
happy hour M-F from 4-7”. It doesn’t matter that that information might originally be in a database that builds
the page, because to interpret the information we need to read the page. Linked Data challenges us to use the
principles of the web: unique identification of resources and links between those resources, to the data itself.
URIs, RDF, and Ontologies
URIs. URI stands for “uniform resource identifier”. URL’s are “uniform resource locators”, a type of
URI used to uniquely identify pages and locate them. You "point" your browser to them to locate the
6. unique page. In the semantic web, things in the real world are also given URIs. In fact they use the
same syntax. For, example, a local bar, The Artful Dodger, might have a webpage with a URL. But in
the semantic web the bar itself--as a unique thing, not a webpage--would have a URI. The URI could
be used to differentiate it from the Dickens character of the same name, which might also have a URI.
RDF. RDF stands for Resource Description Framework. RDF is used to encode the most basic
statement about a thing in the form of what is called a "triple": subject--predicate--object, which is
often abbreviated S--P--O. For example:
o The Artful Dodger (subject), Is a (predicate), Bar (object).
o The Artful Dodger (subject), Has a (predicate), Happy Hour (object).
Ontologies. Ontologies are collections of information about classes of things and their relationships.
For example, an ontology might state that a Bar (subject) is a Type of (predicate) Restaurant (object).
So if we have data that states that the Artful Dodger (S), Is a (P), Bar (O), then, by referencing our
ontology, we can infer that the Artful Dodger (S) is a (P), restaurant (O). Inference is a very important
concept in the Semantic Web that will be covered more closely in the next example.
The following is how these fundamentals apply to vocabularies, and specifically to the capability of mapping
two vocabularies together.
Concepts are Things
Just like The Artful Dodger is a thing, the concept of a “Bar” is also a thing. Vocabularies, taxonomies, thesauri,
etc. are all collections of concepts.
Concepts have URIs
As discussed above, things are identified in the semantic web with URIs. A vocabulary is about concepts, their
properties, and their relationships.
SKOS is for vocabularies
SKOS stands for “Simple Knowledge Organization System”, and is a very popular Semantic Web ontology. It is
a standardized set of classes and properties for representing “knowledge organization systems”, a.k.a.
vocabularies. Unlike other thesaurus and taxonomy standards, the fact that SKOS is Linked Data means it
allows for the management of vocabularies in a distributed, linkable, interoperable, way. In other words, as
part of the open-ended, web of data. The following are some of the most common SKOS elements:
skos:Concept and skos:ConceptScheme. These are what are called "classes" in the Semantic Web. Every
"term" in a vocabulary is of the class skos:Concept. A skos:ConceptScheme designates a particular
thesaurus, or part of a thesaurus. Concepts and ConceptSchemes are both things, so each is
represented by a unique URI. Concepts can belong to one or more ConceptSchemes.
skos:prefLabel, skos:altLabel, and skos:definition. These are used to designate fundamental properties
of the classes like skos:Concept. Skos:prefLabel is preferred name for a concept, and skos:altLabel is
any synonym or other type of entry term, e.g. “see from”, “used for”, etc. An important feature of
SKOS as well as most Semantic Web ontologies is that the classes and properties are extensible. For
7. example, we can create a sub-property, unique to our context. We might want a specific type of
skos:altLabel to differentiate acronyms from other synonyms. In this case, we would create a new
property for acronym, but make it a sub-property of skos:altLabel. In this way, we can identify the
acronyms specifically if we need to, but they can also be referenced as generic synonyms. This is
particularly useful when working across multiple vocabularies where custom properties can all
resolve to generic SKOS standards.
skos:broader, skos:narrower, skos:related. These are the essential properties that serve as the building
blocks for a hierarchical or ontological structure, within a single vocabulary.
skos:exactMatch, skos:closeMatch. All of the above properties are intended for use within a particular
vocabulary or skos:ConceptScheme. These matching relationships, on the other hand, are used to
relate Concepts across vocabularies. For example, the concept for Bladder Cancer in one vocabulary
might be an exactMatch to the concept for Bladder Neoplasm in another vocabulary.
Now that some basics about Linked Data and SKOS have been covered, we’ll look at a specific example of
using SKOS to delivering the core capability of Mapping. Here are two concepts from two popular biomedical
thesauri: the National Cancer Institute (NCI) Thesaurus or NCIt, and the National Library of Medicine’s
Medical Subject Headings or MeSH, encoded in SKOS:
NCI Thesaurus – “Bladder Neoplasm”
URI = http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C2901
(Shorthand: ncit:C2901)
Subject (S) Predicate (P) Object (O)
ncit:C2901 a skos:Concept
skos:prefLabel “Bladder Neoplasm”
skos:altLabel “Tumor of Bladder”
skos:definition “A benign or malignant…”
skos:broader ncit:C2900 (“Bladder Disorder”)
skos:broader ncit:C3431 (“Urinary System Neoplasm”)
MeSH – “Urinary Bladder Neoplasms”
URI = http://purl.bioontology.org/ontology/MSH/D001749
(Shorthand: mesh:D001749)
Subject (S) Predicate (P) Object (O)
mesh:D001749 a skos:Concept
8. skos:prefLabel “Urinary Bladder Neoplasms”
skos:altLabel “Cancer of Bladder”
skos:definition “Tumors or cancer of the URINARY
BLADDER…”
skos:broader mesh:D014571 (“Urogenital Neoplasms”)
skos:broader mesh:D001745 (“Urinary Bladder Diseases”)
Once we can uniquely identify each of these concepts with URIs we can record some basic things about them:
They're both concepts.
They both have preferred labels
They each have synonyms or entry terms (many more, in fact, than just those listed here)
They have definitions.
They have one or more parents because both of these thesauri are poly-hierarchical.
A couple critical points here: If we look at the full record for this concept in MeSH
(http://www.ncbi.nlm.nih.gov/mesh/68001749), it does in fact have the NCIt preferred label, “Bladder
Neoplasm”, as a synonym. However, NCIt does not have the MeSH preferred name as a synonym. Note
especially that alternate labels of “Cancer of the Bladder” in MeSH and “Tumor of the Bladder” in NCIt are
unique to each vocabulary.
But what if we know that mesh:D001749 is the same concept as ncit:C2901? We map those two concepts
together by creating a new “triple” that creates a relationship between the two URIs.
Subject (S) Predicate (P) Object (O)
ncit:2901 skos:exactMatch mesh:D001749
So now we've encoded the relationship that MeSH concept labeled “Urinary Bladder Neoplasms” equals the
NCIt concept labeled “Bladder Neoplasm”. How do we realize the power of this relationship? What if we’re
searching two carefully indexed repositories--one indexed using MeSH and another using NCIt and we want
to produce a joined up view of these two repositories, leveraging the power of the two separate indexing
methods? What if we’re searching across multiple repositories and we want as rich a synonym set for
“Bladder Cancer” as we can get. As we see in this example--by using only one of these vocabularies, we could
end up missing things.
9. Here’s how we use Linked Data to do really interesting things. Now we’ll create a rule that builds upon our
new matching relationship to bring more intelligence to our vocabularies. For this we use SPARQL. SPARQL
is akin to SQL for relational databases in that it is the language with which to query and manipulate Linked
Data in the Semantic Web.
SPARQL Rule:
CONSTRUCT ?s skos:altLabel ?o .
WHERE ?s skos:exactMatch ?match .
(Note that this is not executable SPARQL; the syntax has been simplified for the clarity of this example.)
Creating these rules is key. When we hear experts talk about making your data “more intelligent”--this is what
they mean! This three-line SPARQL command takes that new matching relationship and reasons that the
preferred labels or synonyms in the matched MeSH concept can be inferred as additional synonyms (in red
italics below) in the NCIt concept. It would do this for all the concepts in the NCIt that have been matched to
Mesh terms.
Subject(S) Predicate (P) Object (O)
ncit:2901 skos:prefLabel “Bladder Neoplasm”
skos:altLabel “Urinary Bladder Neoplasms”
skos:altLabel “Cancer of the Bladder”
Those terms “reasoning” and “inferred” are very important. First we’re making a rule that a computer can use
to infer additional information. Second--that’s all this has to be--an inference. We are not necessarily hard-coding
additional vocabulary data. We are not manipulating the source vocabularies at all. Every inferred
triple can be managed completely separately from either source vocabulary!
A fair question to ask is, "What about meta-thesauri and large vocabulary mapping efforts? Aren't they
intended to do the same thing? For example, isn’t that what the National Cancer Institute's Metathesaurus and
the National Library of Medicine's UMLS (Unified Medical Language System) initiatives are for?” Yes, to a
point. However:
1. MeSH and NCIt are just two familiar examples. What if we try to map vocabularies or search across
repositories that use less standard vocabularies: proprietary, internal, or custom vendor vocabs? Or
if we need to search, and view information in an interface using the richness of NCIt, but against an
repository that uses something different--a folksonomy perhaps, or just values selected from a drop-down
list.
?match skos:prefLabel | skos:altLabel ?o .
10. 2. What if the mapping we wanted to do was proprietary to a project or a company? Every taxonomist
will agree that how concepts are arranged and mapped to each other has power. However, what if
that mapping was relevant today--but not tomorrow? How can this be accomplished by leveraging
these large published vocabularies without corrupting them, or worse--by building yet another one?
It is important to be clear that identifying those initial matches (and they are not always as straightforward as
an “exactMatch”) is the hard intellectual work of vocabulary management that is, and will continue to be, a
hallmark of our profession. The semantic web doesn’t solve the problem of knowing which terms to match.
But what it does do--unlike any other taxonomy, thesaurus, vocabulary management system--is give us the
standards and tools to encode those matches and create rules that computers can use to make additional
inferences. And very importantly, it does not require that we merge or integrate or otherwise corrupt original
authoritative sources. So this approach doesn’t make it easy to manage vocabularies in a hyper-connected
world--that’s not the point. The point is it makes it possible; possible, sustainable and extensible.
Linked Data Vocabulary Management Roles
In thinking about how to make this all work and to develop and deliver vocabulary services based on these
new approaches, the following roles could be assigned.
Vocabulary Developer. Uses Semantic Web technologies and informatics toolkit to ingest, model, bridge,
slice, quality check, and version-control vocabularies. This would require some programming skills, and in-depth
knowledge of the technology.
Vocabulary Manager. Provides hands-on management, development and enhancement of vocabularies.
They would have a deep understanding of vocabulary processes; and internal/external vocabulary landscape
and best practices, often with subject specialty. This role is most like our taxonomist role of today.
Vocabulary Administrator. Provides first-line support for helping customer’s access, use and exploitation of
available vocabularies. They are accountable for management of service processes, documentation, and
inventories.
Conclusion: The Semantic Web needs us
The Promise and the Reality. Linked Data approaches are the sustainable way to work with vocabularies in
an increasingly federated and linked information world. The reality is that not enough vocabularies are
available, from the authority that produced them, natively in SKOS or another form of RDF. This needs to
change. But we can start to instill these practices where we work. This is field is rapidly developing; it needs
librarians and information scientists with their deep knowledge of the importance and power of controlled
vocabularies and the information sources that use them.
This is not the future--it’s now. Leading organizations in the US, like OCLC and the Library of Congress, are
adopting these standards. This even more true in Europe. The field of biomedical ontologies is very exciting
right now, and firmly moving toward a Linked Data paradigm. The nature of business and of information
demands new approaches. Even taking the Semantic Web’s promise of universally sharing data across the
world out of the picture, SKOS and RDF--even just applied to vocabularies--is opportunity alone. These
11. vocabularies can be published in whatever form people or systems need them in. However, to manage a web
of vocabularies requires tools and techniques designed for the web.
This is not easy. Breaking down information, including vocabularies into their most basic constructs, is what
RDF does. It is elegant in design, but can quickly become very complex in execution. Writing the SPARQL
queries that do complete transformations of vocabularies, manage versions, etc. can become very
sophisticated. But isn't it a challenge worth tackling? Aren’t librarians and information scientists poised to
make this happen?
SKOS is a great example of a simplified model with practical application. RDF can be used to model ontologies
in linguistically accurate, but extraordinarily complex, ways as well. The discussions at a Bioontology or
Semantic Web conference can get very deep. But the beauty of RDF is that those complex models can be
transformed, again without corrupting the source, into a simple model like SKOS. Guarding against
unnecessary complexity is something that librarians are especially good at.
Managing unconnected silos of information, including taxonomies, will not harness the power of a networked,
collaborative world. Vocabulary interoperability, enabled by the Linked Data principles of the Semantic Web
will--if further developed as a discipline within our profession.
Acknowledgements
This article was derived from a presentation given by James R Morris at the Special Libraries Association’s
2013 PHT and BIO Division Spring Meeting, April 14-16, 2013.
The author thanks:
The AstraZeneca R&D Vocabulary Team and Linked Data Community of Practice members:
Courtland Yockey, Sorana Popa, Rob Hernandez, Dana Crowley, Tom Plasterer, Mike Westaway,
Kerstin Forsberg, Johan Turnqvist, Tracy Wolski, Dan Ringenbach, Rajan Desai, Pete Edwards, Simon
Rakov, Jeff Saxe, and Ian Dix.
Vendors and Advisors:
o Scott Henninger, Bob Ducharme – TopQuadrant
o Dean Allemang – Working Ontologist
References
• Dean Allemang, Jim Hendler. Semantic Web for the Working Ontologist, 2nd edition. Morgan
Kaufman, 2011.
12. The first few chapters of this book provide one of the best overviews of the semantic web that you’ll find,
and is very readable. It also has a whole chapter on SKOS. Available through ScienceDirect and Safari.
• Bob Ducharme. Learning Sparql, 2nd edition. O’Reilly Media, 2013.
Invaluable to have on hand when writing SPARQL.
• World Wide Web Consortium (W3C). SKOS Simple Knowledge Organisation System Primer.
http://www.w3.org/TR/skos-primer Straight from the source, it is technically oriented, but once you
understand the basics, this is essential reading.
• National Center for Biomedical Ontology. BioPortal. http://bioportal.bioontology.org .
A phenomenally deep resource providing access to many biomedical vocabularies, queriable through
SPARQL. The NCBO also has several webinars available.
• Lee Harland, et al. “Empowering industrial research with shared biomedical vocabularies” Drug
Discovery Today, Volume 16, Issues 21-22, November 2011, Pages 940-947, ISSN 1359-6446,
10.1016/j.drudis.2011.09.013
Excellent overview of the challenges and opportunities for using biomedical vocabularies in
pharmaceutical research.
• Kerstin Forsberg. Linked Data and URI:s for Enterprises (Blog). http://kerfors.blogspot.com.
A thought-leader and evangelist for linked data, especially in the clinical domain.