1. The Dynamic Potential of
Semantic Enrichment
or, Everything You Always Wanted to
Know About Semantic Enrichment
OK, not everything.
Not even most things.
Just some things you probably should be aware of.
Allen Press
Emerging Trends in Scholarly Publishing™ Seminar
14 April 2011
Pam Harley
VP, Product & Market Development
SemedicaTM A DIVISION OF SILVERCHAIR
pamh@semedica.com
(434) 296-6333 x372
2. Why me?
Me
20+ years in STM publishing, many hats worn
print, digital
books, journals, news, continuing education…
editorial, production, product development
Silverchair
10+ years working with STM publishers to build products
and features from semantically tagged content
2
3. Here’s the plan
WHAT is semantic enrichment
WHY you should care (benefits)
HOW to get started
(with a few side trips to make sure we’re all on the
same page re: lingo)
3
4. First…
DON’T
do what I’m about to do
Don’t start by exploring technology
(Hint: Start with user stories)
4
5. What’s a user story?
a user story captures what the user wants
to achieve—who wants the functionality
and why it allows that user to achieve
something useful
5
6. Creating user stories
Focus your tagging strategy on user stories—how
people want to use your content:
What tasks are they trying to do when they use your
product? What answers are they looking for? At what point
in their workflow is your product used?
Almost all information sites have multiple user
stories. Know them for your products
Remember that your organization is also a key
user of your product
6
7. WHAT
is semantic…
enrichment
tagging
markup
indexing
fingerprinting
classification
categorization
?
7
8. Semantics are about
meaning
The meaning of content is currently written for
human understanding, not computers
Semantics adds a layer of meaning to your
content, so that computers can make sense of it
and build connections to it
Semantic metadata answers the most important
question of all for content producers and users:
What is this content about?
captured in a way that computers can process
8
9. “Atomizing” information
A semantic approach requires you to go beyond
documents and think of your content as data
Semantic markup allows knowledge in your
publications to be acted on as distinct bits of data
For example:
1 practice guideline = 1 document
OR
1 practice guideline = 312 distinct pieces of data
9
10. Taxonomy is the
semantic foundation
Taxonomy is the framework for the semantic layer
and semantic tagging
It allows…
Normalization
Consistency in tagging
Concept grouping and hierarchical relationships
Integrations/interoperability (internal and external)
10
11. Equivalent relationships are
critical
Synonyms, abbreviations, jargon, misspellings,
codes are a critical component
Necessary to normalize the natural and constantly
evolving variations in the language that authors use
to describe concepts and searchers use to find them
Vastly improve performance of autotagging systems
Precise strings are easier to match programmatically, and a
thesaurus magnifies the number of strings available to match to
a given concept
11
12. Normalization
Authors use different terminology to represent the
same topics
Examples:
Synonyms (newborn = neonate)
Abbreviations (GHB = gamma hydroxybutyrate)
Shorthand (c diff = clostridium difficile)
Searches for these language variations produce
different results
A semantic layer controlled by a taxonomy/
thesaurus normalizes these variations
12
17. Where does a taxonomy
come from?
Your content collection
Inputs from your users (e.g., author keywords,
search logs)
Subject matter expert consultation
Industry standard terminologies
Source for concepts, equivalents, guidance on hierarchy
17
18. The importance of industry
standard terminologies
Your taxonomy must be able to interact with
standards of your domain to forge meaningful
external integrations
Many terminologies are in use in different scientific
domains (UMLS, ACS, ACM, AIP, IEEE, OSA, EPA,
NASA, USGS…). Investigate what’s available
Great case example for domain-level taxonomy:
For medical content, UMLS metathesaurus maps together 100+
constituent health care vocabularies (MeSH, SNOMED, ICD,
RxNorm…) to support health care interoperability
18
19. Don’t reinvent the wheel!
If there’s a taxonomy available that’s a good fit, use it
BUT make sure you have a mechanism for adapting
it to meet the needs of
your content
your users
the pace of change/new concepts in your field
[Note to STM publishers in cutting-edge areas: You can’t
wait for the standards to catch up to your research output—
you’ll need to be able to add concepts at the time of
publication]
19
20. Ongoing taxonomy management
Taxonomies must be continually enhanced as
your domain evolves, your content set grows,
and your user needs and expectations change
Make sure it is easy to update your taxonomy and
make it available to your systems (tagging, web
applications), ideally in real time
Taxonomies should always be
considered a work in progress!
20
21. Application of taxonomy to
content—semantic tagging
Semantic tagging is the insertion of semantic
information at the level of XML elements
Example: <root-term termID="47521">t cells, regulatory</root-term>
Tagging can be embedded directly in XML,
provided as separate reference files, or placed
in database tables that reference elements
If the content is inaccessible (e.g., images and
videos, PDFs) tagging can be placed in header
files
21
22. Who/what tags?
Automated tagging—software analyzes content, adds tags
based on concept matching, patterns, grammar
Pros: Highly scalable, good at finding trends in large bodies of content. Sometimes the
only option for very large data sets
Cons: False positives, missed concepts
Manual tagging—humans with appropriate expertise
(sometimes called Subject Matter Experts, or SMEs) read the
content and apply tags
Pros: Precise, ideal when clinical judgment is required
Cons: Cost-prohibitive for large volumes of content, hard to scale, inconsistent
(humans make subjective choices!)
Hybrid—automated process followed by manual
review/modification
For high-value, specialized sites (such as clinical decision support that require “one best
answer” results) this extra human touch can be necessary
Some content types aren’t accessible to automated systems (multimedia)
22
23. <collection1, collection2> Tagging for different uses
<summary>
<Collections> What “buckets” does this
Disease content object belong in?
<summary>
Diagnosis Assignment of content into topical
Lorem ipsum dolor sit amet, cras sagittis velit velit fermentum dignissim,
<odio purus>, in enim phasellus eget, tincidunt suspendisse tempus.
collections for major site navigation or
<Egestas tempor> eu id velit rutrum, per diam arcu eget nec placerat. product definition
<summary>
TABLE. Rewrewqrq <rewqrewreq dsfdsafsda> topic collections; microsites;
fdsfsdafdsfds fdsfdsfdsafds fdsfdsfdsfds virtual journals…
rewrewrq rewqrwq rewrwq
<Section Summaries> What is this
<summary>
Subheading. <Pretium consequat> luctus nascetur. Interdum section/article/chapter about?
et quis malesuada pellentesque. Lorem nonummy <massa tristique>
augue viverra., ridiculus eleifend at. Most significant topics discussed at the
article/chapter/ section (wrapper) level
<summary>
answers to clinical questions; review;
skills assessment…
FIGURE. <Tincidunt suspendisse> tempus cras.
<Entities> What is this thing?
<summary>
Treatment Important concepts at the paragraph/list/
<Tincidunt> suspendisse amet, cras sagittis velit velit fermentum dignissim, table/figure (granular) level
odio purus, in enim phasellus eget, <tincidunt suspendisse tempus>. Egestas
tempor eu id <lorem ipsum dolor> sit amet.
complex search queries; concept overlap
References analysis; specific entity types like drugs,
1. Lorem ipsum dolor sit amet, cras sagittis velit velit
2. Lorem ipsum dolor sit amet, cras sagittis velit velit fermentum genes, clinical trials, manufacturers… 23
24. WHY
should you care
(What are the benefits?)
24
25. Failure of the status quo
Information scarcity is no longer the issue.
Attention scarcity is the problem.
The publisher’s role in information curation and
filtering has never been more important.
However, the tools to achieve them are changing.
“Information is a source of learning. But unless
it is organized, processed, and available to the
right people in a format for decision making, it
is a burden, not a benefit.”– William Pollard, Physicist
25
26. Search accuracy, precision
Faster, more accurate and reliable answers to
questions enhance user productivity and thus
improve your application’s usability and user
satisfaction ratings.
The accuracy threshold for STM information is very high!
Users increasingly will not tolerate ambiguous results.
Time-strapped users are struggling with information
overload—fewer, better answers are often preferred.
Tagging allows exposure of hard-to-find media like images,
videos.
26
27. “Which did you mean?” at McGraw-Hill’s
AccessMedicine
27
29. Pathways to related content
Related search terms
Links to related content within and across
resources
Dynamically generated as new content is
added
Goal: Increases serendipitous discovery, site
stickiness, and usage metrics like number of
page views and time on site
29
32. Contextual integrations
Internally—across titles and content types
(journals, books, videos, images, e-learning…)
Externally—with partners and external data sets
Increasingly important to integrate content into
customer workflows—to bring content to them
in context as they do their daily work
clinicians at point of care
students as prepare for exam
32
33. New products
Content recycling: Create new products from
content you already have
Image collections
Mashup and micro products that serve specialized audiences and
fit specific workflows
Topically constructed objects like virtual journals, knowledge
environments, coursepacks, learning objects
You can cost-effectively create
niche products not possible before
33
35. Search engine optimization
Granular topic exposure leads to better
ranking in major search engines
Next wave of discovery tools (intelligent agents,
virtual research assistants) will give greater weight
to content they can understand
Tags can also be exposed to help create auto-
extracts for content that doesn’t have abstracts
(like book chapters)
35
37. Semantic users
As users search and navigate semantic content, you can attach the
tags on that content to them
A semantic profile for a user can be created from his/her site
activity
What topics are they interested in?
How are their interests evolving?
Use this information to create personalized information services
Excellent method for encouraging anonymous institutional users to
register/log in
Use topical affinities between users to create communities of
practice—groups of people who share a passion for something they
do and learn how to do it better through social interaction
37
38. Contextual advertising
Match article and ad semantic tags to precisely
target ads based on topic
OR, block ads from appearing next to articles on
related topics
OR (even better): Alternative advertising method
Advertising can be targeted to the user profile, not just the article
Avoid targeting editorially sensitive pages but still deliver ads
that match that user’s interests on neutral pages or alerts
For classified/job ad targeting, user interests can be matched up
with demographics like location
38
39. What about mobile?
Reduction in number of
clicks!
Precision in search
Quick links to what
most users need
Targeted navigation that
leads to content most
important (answers to
clinical questions)
39
41. Questions for you and your
application/hosting providers
What are your user stories/use cases?
What are the business benefits/ROI for your
organization?
What content do you need to tag, how is that content
delivered, and can those delivery systems/platforms use
taxonomy and tagging in a way that supports your user
needs?
What’s your plan for keeping your taxonomy up to date?
Can your “living” taxonomy be integrated into your
applications? In real time as you make updates?
41
42. Questions for semantic tech
providers
Does the technology support your user stories/
use cases?
Does it offer/integrate with a constantly evolving
taxonomy?
Does it meet the accuracy threshold for your users
and your content?
Can it tag at the depth—both granular and summary
level—necessary? Figures and tables? Top-level
collections?
42
43. The semantic user story
I am specifically identifying --------------
because -------------------- is very important
to my ------------------- users
when they are ------------------ -.
43
44. The semantic user story
I am specifically identifying concise disease
treatment content because immediate access to
treatment options is very important to my
emergency physician users when they are seeing
20 patients an hour.
44
46. The semantic user story
I am specifically identifying skin disorder images
on all body locations and all types of skin because
visual diagnosis is very important to my family
physician users.
46
48. What are your user
stories?
Problems/needs to solve for your users
Delivering top quality care under serious time constraints
Explosion of new research to keep up with and integrate into
practice
Need to pass a licensing exam
Problems/needs to solve for your
organization
Creating new products that grow and diversify revenue
Creating more value from advertising
Gaining insight into users
48
49. Thank you! “Organizing is
what you do
Pam Harley before you do
VP, Product & Market Development something, so
SemedicaTM A DIVISION OF SILVERCHAIR
pamh@semedica.com
that when you
(434) 296-6333 x372 do it, it is not all
mixed up.”
www.silverchair.com
www.semedica.com –A. A. Milne
49