12 Things the Semantic Web Should Know about Content Analytics
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER
12 Things the Semantic Web Should
Know about Content Analytics
Seth Grimes, Alta Plana Corporation
June 2011 | Sponsored by OpenText
Abstract
Content analytics is sense-making technology. It semanticizes online, social, and
enterprise content. It facilitates semantic data integration, search, and information
management and is an underappreciated foundational technology for building the
Semantic Web. Technologists and business leaders alike will benefit from understanding
the role content analytics plays in semantic computing, starting with 12 essential points.
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 2
Contents
Introduction.................................................................................................................3
The Semantic Web and Content Analytics...............................................................3
1.
Entity extraction is a form of content analytics ................................................3
2. There are more entities that are dreamt of in DBpedia, Freebase,
Reuters.com, and the like................................................................................4
3. Content analytics discovers, annotates, and extracts the broad range of
information in content, far beyond entities.......................................................4
4. Content analytics handles subjectivity: Sentiment, opinion, and emotion..........5
5. Content covers more than just text managed in a content management
system and published to the web ....................................................................6
6. Content analytics is part of a collection of complementary and overlapping
analytical technologies ....................................................................................7
7. Content analytics generates semantic and structural metadata ........................7
8. Content analytics facilitates semantic search and semantic data integration....8
9. Content analytics scales from individual messages to wide data spaces
and large corpora ............................................................................................9
10. Content analytics can operate in real time for a wide variety of business
goals and business domains ...........................................................................9
11. Content Analytics is delivered installed, on the cloud, and as-a-service:
Your choice......................................................................................................9
12. Content analytics can be customized, extended, and configured via
inclusion of controlled vocabularies, taxonomies, and ontologies.................10
Conclusion................................................................................................................10
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 3
Introduction
Semantic computing exploits machine-represented meaning to enhance search, data
integration, knowledge management, and information-centered business processes. The
ultimate goal is to enable automated knowledge discovery and business-process
execution across a linked data web. However, this goal will not be reachable in any
meaningful sense unless and until a broad set of information-rich endpoints is available
for major business and personal purposes. These Semantic Web endpoints – triple
stores that capture entities and relationships, supporting distributed query and inference –
and other forms of semantically annotated content aren’t instantiated and populated by
some magical process. They must be created.
The creation of meaning – the generation of structured information from “unstructured”
sources – is the province of content analytics. Content analytics, along with modern
applications that couple content production and annotation along with efforts to map
databases into linked-data repositories, are the foundational technologies that facilitate
semantic computing and populate the Semantic Web.
So long as the Semantic Web lacks a critical mass of usable data from online, social, and
enterprise sources, the Semantic Web will have form but not function. The set of core
Semantic Web technologies, a stack of standards and protocols, on their own are not
enough. The Semantic Web and broader semantic computing need data, yet almost no
historical information, and very little of the information being produced today is in
semantic formats. Content analytics can extract semantics for that mass of
“unstructured” information to provide semantic structure. By semanticizing the range of
existing content, content analytics can and will fuel the realization of the Semantic Web.
The Semantic Web and Content Analytics
Despite its very important (and as yet mostly potential) Semantic Web role, and despite
the business value being delivered today by content analytics, the technology, solutions,
and broader applications are not sufficiently well understood; hence this paper, 12 Things
the Semantic Web (and semantic computing practitioners) Should Know about Content
Analytics. Let us start with a fundamental point:
1. Entity extraction is a form of content analytics
Entities are concrete things, often named in some form of lexicon; for example, people
(Thor, Barack Obama), companies (IBM, General Motors), places (Paris, Canada), events
(the World Series), enzymes (hexokinase), and even research papers (“The
Unreasonable Effectiveness of Data”). Entity extraction is a process that starts by finding
entities in source materials, whether web pages, email, audio streams, images, or some
other material of interest. Once discerned, the entity is disambiguated (Is “Ford” a car, an
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 4
industrial company, an actor [which?], a theater, or a place to cross a river?). Then it is
typed (Person, Organization, etc.), and (perhaps) mapped into a canonical form
according to a controlled vocabulary. It may be designated with a uniform resource
identifier that facilitates associating diverse information to the source material.
Entity extraction is a form of content analysis. It involves reaching into the content,
whatever its form, and understanding the inherent structure that is apparent to any
educated human reader: the “chunks” into which text and other content is separated, the
word morphology, grammar, and larger-scale structure that humans grasp without
conscious reflection. The parsing steps may seem simple, but tasks such as
disambiguation, which entails consideration of context and usage, decidedly are not.
Vikings in a sports article are different from Vikings in a history text; beyond document
type, word sequence “the Vikings lost their fourth straight game” tells us which sense of
Vikings is in play. Yet –
2. There are more entities that are dreamt of in DBpedia,
Freebase, Reuters.com, and the like
Common entity sources do not cover all business, scientific, news, or cultural domains.
An entity annotation service designed foremost for financial news sources won’t help you
much with laboratory science or understanding Iraqi Arabic blog chatter.
Content analytics tools support a variety of techniques that allow you to go beyond the
common sources. Tools may allow you to import and apply your own lexicons and
taxonomies, and they may infer new entities via syntactic analysis and machine learning
(techniques that decode grammar and apply pattern analyses to build or expand on a list
of features of interest). Further, content analytics may resolve anaphora, including
pronouns as well as other forms of co-reference, accepting different ways of referring to a
single thing. The application of natural-language processing helps us understand that in
the text –
“Sarkozy's desire to become the new President's main international partner – and,
indeed, personal friend – was palpable. Consequently, the famously passionate and
emotive Frenchman responded to Obama's reserved personality…”
– “the new President” is Obama and “the famously passionate and emotive Frenchman”
is Sarkozy. But entities are not all that content analytics can find.
3. Content analytics discovers, annotates, and extracts the
broad range of information in content, far beyond entities
RDF schemas capture relationships among entities: FriendOf, EmployedBy, OwnerOf,
and so on; the lists are long, varying by data space. Entity relationships may be
engineered in a top-down, prescriptive manner, or they may be mapped from sources
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 5
such as relational databases that capture relationships. Wherever they originate,
relationships are the key to knowledge and raw material for inference.
If your approach is to extract entities and restrict yourself to relationships expressed in
ontologies or other knowledge repositories, you may be leaving vast amounts of valuable
information unanalyzed. Source materials capture and express relationships. After all, a
blog posting, a tweet, an article, an e-mail message, a video: every form of content was
created to communicate. It would be silly to parse a news article and report that country
X, person Y, and company Z were mentioned without also extracting the entity
relationships present in the text.
Content may contain conventional data, and not just in marked-up data tables. Consider
a sentence from a datelined article,
“The Dow Jones Industrial Average finished the trading day at 12,605.32, up 45.14
points (0.36 percent). The S&P 500 closed at 1,343.6, up 2.92 points (0.22 percent).”
Content analytics can extract this data, to RDF or to a database table, along with
metadata such as the names of the article author and publication, the publication date,
the article’s URL, as well as other available information from HTML Meta tags and page-
embedded FOAF, RDFa, or other microcode. Content analytics can infer from the text –
“Among actively traded Colorado stocks, Accelr8 Technology Corp. (AXK)...”
– that (possibly) named entities Accelr8 Technology Corp., AXK, and Colorado are
related; sophisticated content analytics will ascribe the ticker symbol AXK to Accelr8 and
capture that Accelr8 is located in the geographic area Colorado. Beyond these facts and
relationships, strong content analytics will associate the conceptual class “stock market
index” with the DJIA and S&P 500 and will identify topics such as “financial markets
reporting” and themes such as “the economy” with the source article.
How far beyond entities?
4. Content analytics handles subjectivity: Sentiment,
opinion, and emotion
We can classify information as factual or as subjective. Attitudinal information –
sentiment, opinions, emotions – is very important to business applications that include
customer service and support, marketing, product, and service quality, contextual
advertising placement, and policy and politics. A business that is listening will pick up on
tweets such as –
@robwolfeusa Wow, at #Hilton in Long Island. Exec floor room guaranteed not
available and no rooms clean and available at 4:30PM.
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 6
– that indicate problems. Content analytics, in this instance, will understand what hotel
property is being referred to, what the issue was, and who was posting (the potential
often exists to match a social handle to a name or other identifying information and from
there to actual business transactions); this facilitates processing and quick responses.
This example looks at and matches individual records; content analytics is also applied to
aggregate sentiment, classified by familiar categories such as location, age, and sex as
well as by company specific dimensions such as product and location.
This class of subjectivity analysis looks for the voice of the customer (or prospect,
influencer, voter, patient, or market) as expressed online in blogs, forum postings,
reviews, email, surveys, contact-center conversations, and a range of other feedback
sources. It is sensitive to the identity of the person who is posting, the needs of the
person who may be consuming the information, to context, and to plans or intent captured
in text. While subjective information may not have the ability to be matched to particular
persons, the benefits of knowing who is posting are prompting entity-analytics R&D into
identity resolution based on clues found in text.
Our next point should be obvious by now:
5. Content covers more than just text managed in a content
management system and published to the web
We have user-generated content online in the form of articles, blogs and comments,
status updates, profiles, and forum postings. And certainly, we have content in the
conventional sense, material that is created and published via formal, managed
processes. But the content label also extends to email, corporate documents and
reports; SMS/IM text, contact-center notes and transcripts; and also, as mentioned, to
audio streams, images, and video. This includes the above in original, as-created form
and in derived (duplicated, quoted, sampled, distorted, and otherwise reworked) forms.
Consider rich-media content in particular. Content analytics solutions are already in use
to search, analyze, and mine audio streams for contact-center applications, capable to
search not only on speech transcribed to text but on phonemes, on the fragments from
which speech is composed, with advanced abilities to distinguish among speakers in a
conversation and to detect emotion. A consumer-grade electronic camera’s ability to
identify people within the photo frame and to detect whether a subject is smiling or
blinking is content analytics; automated image recognition capabilities, and not just via
externally applied tags, are advancing rapidly, as is ability to decode image changes in a
video stream.
Content analytics, coupled with (other) SemWeb technology and operating independently,
can be applied to the spectrum of information types across organizational barriers.
Analytics, broadly drawn, provides the key.
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 7
6. Content analytics is part of a collection of
complementary and overlapping analytical technologies
Analytics is the search for business insight in online, social, and enterprise data.
Analytics comes in many forms, under a variety of names. The definition common to
them all is that analytics transforms source data to derive business information that is
stored to databases and communicated in the form of numbers, tables, charts, and
visualizations.
Data mining discerns patterns in data in structured forms, typically in databases, to
produce predictive models suitable for classification, forecasting, and other functions. BI
typically applies dimensional models to data and supports reporting and interactive data
analysis, but it may also include predictive-model deployment and in some instances, will
also subsume the data mining process. Web analytics is not typically grouped under the
BI umbrella, but it is BI, drawing from web server log files to mine behavior patterns from
click-stream data, presented in familiar BI dashboards, reports, and charts and feeding
data-mining processes that seek to model quantities such as website conversion (a fancy
name for sales) and shopping-cart or session abandonment. Social-network analysis
looks at the dynamic graph of connections and message propagation across social and
enterprise platforms. Lastly, location intelligence is a special sort of BI with data types,
structures, analysis, and presentation methods tailored for geospatial data.
These analytics variants operate on numerical, quantified data. Content analytics
complements them, in some cases by extracting data (e.g., geographic locations and
numbers from data tables) from textual sources and in other cases by using their
capabilities for exploratory analysis of text sourced information; for instance, when
classified by geographic source or topic and rendered in a map, when presented in BI
dashboards and charts, and when incorporated in predictive securities-trading models.
But content analytics can do more than just quantify free-form sources, shown in our next
two points.
7. Content analytics generates semantic and structural
metadata
Metadata is descriptive information. Comparing content to a letter, the writing, and
postmark on the envelope is metadata. Consider electronic examples: the values of the
To, From, CC, Subject, and routing header fields of an email message; the author, file
name, file type, last-saved date, title, language, and tags applied to a document; values
annotated with web page META tags, and so on. Some of this metadata is structural,
some of it is semantics.
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 8
The Dublin Core Metadata Initiative is perhaps the most prominent metadata-standards
proponent, providing for natural-language and formal semantic shared vocabularies that
facilitate interoperability.1
The natural-language processing (NLP) components of content
analytics solutions can and do discern and extract metadata from free-form and semi-
structured source materials; all done with the possibility of Dublin Core conformance and
of meeting particular, situational needs by extracting advanced metadata such as topics
and themes.
Content analytics tools will, depending on the provider and on the user’s needs, create
and store an XML-/RDF-/FOAF-annotated version of source materials, extract information
of interest to a file or database, or, when invoked as-a-service, return XML-, JSON-, etc.
marked up.
Here’s where we come to search and linking.
8. Content analytics facilitates semantic search and
semantic data integration
Web pages annotated with concepts, topic, synonyms, etc., and with key information
content micro-formatted– this is Search Engine Optimization (SEO) –will be more directly
accessible as search evolves into information access. For both web search and local
enterprise search, that extracted information can be indexed as the basis for concept and
faceted search (which are two varieties of semantic search), and for faceted navigation,
where users and site visitors see results classified into high-level categories known as
facets (facets may be predetermined or they may have been discovered in source
materials via NLP and clustering).
Content analytics also enables similarity search, where we can search for documents,
messages, or objects that are statistically or semantically similar to one we’re viewing,
and for similar searches, which are search queries similar to the one we have issued.
Similarity measurement is useful beyond interactive search; for instance for tracking the
diffusion of content – messages, press releases, quotations, and so on – across news,
social, and interpersonal messages, whether for media measurement, copyright
enforcement, or research. Given content’s complexity, content analytics’ ability to
“fingerprint” content and measure similarity is an asset in tracking efforts.
Lastly, while annotation is great for SEO and semantic search, it also facilitates data
integration, also known as data fusion and record linkage. For Semantic Web
1
http://dublincore.org/metadata-‐basics/
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 9
applications, annotations would include URIs; for other applications, integration could be
accomplished via other content-extracted key information.
Automatic summarization and abstracting are under the content analytics umbrella.
9. Content analytics scales from individual messages to
wide data spaces and large corpora
Content analytics scales through the use of high-throughput technologies such as
Hadoop and deployment on grid-based, scalable hardware. Further –
10. Content analytics can operate in real time for a wide
variety of business goals and business domains
The choice of particular techniques and tools, where scalability, the need for speed, and
other capabilities are concerned, will depend on the information sources, business goals,
the type of insights to be sought, and the skills of the users. If the business need is for
real-time news and social monitoring for brand and reputation management, security, or
military intelligence, one class of solution will be in order that would be very different in
application from a solution chosen to provide semantic search and navigation for an
online commerce site.
Focusing on real-time capabilities and also the ability to handle noisy social text (replete
with slang, idiom, misspellings, abbreviations, sarcasm, and the like), we see that content
analytics’ capabilities are a neat complement to the structured Semantic Web, which
would be hard-pressed to keep up with today’s flood of raw, chaotic information. The
pairing of structured sources and ad-hoc analyses can be especially powerful.
11. Content Analytics is delivered installed, on the cloud,
and as-a-service: Your choice
Most members of the semantics community are familiar with a few as-a-service
annotation services, accessible via web services APIs. They represent only the visible
top of a much larger, metaphorical, content analytics iceberg. First, there are many more
annotation services, with capabilities that extend far beyond English-language entity
analytics to encompass deep information extraction, in the content analytics world. The
only barrier to their semantics-world and Semantic Web use is lack of awareness.
Further, content analytics is available on the cloud, in hosted form, or may be installed on
your own hardware.
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 10
12. Content analytics can be customized, extended, and
configured via inclusion of controlled vocabularies,
taxonomies, and ontologies
Analytics means flexibility, the ability to square formal methods and structures with ad-
hoc, situational needs and to rely both on shared, standardized resources and on
protocols. It is also the ability to depend on proprietary assets and materials not yet
brought into compliance with modern forms and into the Semantic Web.
Conclusion
We have examined 12 Things the Semantic Web (and Semantic Computing Practitioners)
Should Know about Content Analytics. But really, they reduce to a single paragraph:
Content analytics makes sense of the mess of content – of online, social, and enterprise
text, and moving forward, of rich media including images, audio, and video – for purposes
that extend to semantic data integration, search, and information management. Content
analytics, by helping semanticize existing data, is a foundation technology for the
Semantic Web and semantic computing. Content analytics is delivering business value
today, complementing BI, web analytics, location intelligence, and predictive analytics.
Prospective users can look to a variety of technologies and tools to find or craft a solution
that best meets particular needs, whether for individual, embedded, or enterprise use.
Given that hosted and as-a-service (as well as installed) options are available, getting
started is not difficult; given the breadth of capabilities, standards adherence, and
customizability, there are few adoption barriers. Semantics practitioners will readily see
the value of the technology and will find it well worth trying.