Zemanta - Blog me up!
Faculty of Computer Science, Iasi, Romania
Abstract. There is an augmenting need to help authors publish their content
online, but there are still unsolved issues regarding the enrichment of the
content and making it more readable, discoverable, and interconnected.
Zemanta is a tool for content-understanding and recommendation in real time,
widely used by bloggers because of its relevant suggestions which make
blogging more fun. According to their estimations, it takes only 30 seconds to
have an article published via Zemanta assistance.
Keywords: Zemanta, content recommendation, social web software
Making your work known and visible to the world is most of the times a long-term
process, if we consider the long-winded procedures behind it. It is not just a matter of
content, but of how you present it and relate it to the past works in the field. Finding
the appropriate references and definitions, as well as suggestive images and tags to
describe your article is time-consuming. Besides, this usually requires skills and in-
depth knowledge of specifications and standards. What if there would be an automatic
way of dealing with all these activities, leaving you more time for your research and
the creative process? The solution comes from Zemanta Ltd., which provides a free
tool for content-understanding and recommendation in real time (while you are
writing your post) by performing a semantic analysis of the input text and providing
as output related content, pictures, links, and tags. From now on, it’s the author’s job
to choose from the content recommended by the Zemanta engine the one needed. The
result is not only an attractive, user-friendly design, but also an efficient relation
between your content and some external data (that computers can understand).
The target group is composed of social web software users (e.g. blog networks,
blog farms), professional publishers and also persons with programming skills. This
article will further on focus on bloggers and how Zemanta presents different types of
Zemanta is available for downloading in several formats from their website  as
browser extension (for Firefox 2&3, Internet Explorer 7&8, Chrome, Safari etc.), as
server-side plug-in (for Wordpress, Blogger, Joomla etc.) and as API for developers.
2 Magdalena Jitca
2. Basic principles and features
1. Recommendation of related content
Zemanta searches the content pool for related articles, links, and images, and proposes
the most relevant ones to the author. They will show up in a special sidebar which
constantly keeps updating itself while editing your manuscript. The recommendations
are more accurate for entries longer than 300 words. It is up to you to decide
afterwards if the suggestions are the expected ones and how correct they are. Zemanta
is English-only for now, but I have obtained good results when writing in another
language if the content written about involves trademarked items or well-known
unique entities (personalities, places, companies etc). The comparison will be
presented in section 4.
2. Large-scale knowledge database
Zemanta disposes of more than 10,000 news sources, connecting 100 Million content
objects. The articles it suggests come from hundreds of top media sources on the web,
as well as from the social networks and other blogs of Zemanta users. The images
suggested come from Wikimedia Commons, Flickr, and stock photo providers like
Shutterstock and Fotolia. Zemanta, like many other applications, uses Wikipedia as a
kind of expert system. For example, if a page is linked to from a Wiki page, it is for
certain that the page is relevant to the topic of the Wikipedia page. That kind of
approach can be used for many different tasks, all with the goal of making the web
and web services smarter.
Tagging is not an easy task, but doing it right helps the web grow smarter by marking
up pages, posts, videos, images, and other objects available on the web. Zemanta can
automatically tag the content into general categories among which you can choose the
ones you need. However, it seems that humans are still more efficient than computers
at tagging, that’s why Zemanta made it possible for its users to customize the
categorization. Beside the benefits of making tagging suitable for your own needs, the
choice you make will further be used by Zemanta for refining the recommendation
4. Uploading Custom Content
A client can help improve the content pool by making use of his own experience and
previous work. He can upload RSS feeds which will later be indexed by Zemanta
Enterprise and thus be included in the content pool. The only condition that the
content must fulfill is to be either original or licensed under copyright.
5. Customizing the Recommendation Pool
It is also possible to select the content to be included in the recommendations you get
from Zemanta (e.g. limiting links to own network and trusted sources). Because it is
Zemanta - Blog me up! 3
so well integrated into the blogging platforms, it offers personalization targeted at
bloggers. They can define their own blog sources to browse for and import their
Twitter/Facebook/MyBlogLog contacts for automatic recognition while writing.
6. Copyright filtering
Zemanta also pays close attention to the copyright legislation, making sure that
suggested content is licensed as Creative Commons or approved by third parties, so
the user won’t have any problem by using Zemanta's service. It is mostly the case of
images you have to pay attention to, because tags, for example, are generally not
regarded as creative work and therefore are not protected under copyright terms of
Zemanta offers a special feature of cross-platform quoting for blogs by means of
different techniques to obtain the raw body of the post intended for quoting. One of
them is via HTTP referrer and the second one is via a “request id” that is passed as
part of the URL. For example, you can have your finished article submitted to a blog
of your choice (by supplying the username and the password).
3. Architectural details
3.1 Zemanta system architecture
From the architectural point of view, Zemanta web service is a server that stores the
content received from the application and, when requested, sends suggestions of
related content. This communication is based on a HTTP protocol and makes use of
the standard JSON and XML response formats. Authoring applications such as
content management systems then provide the suggested content to the author, so he
can select the appropriate information to merge into his manuscript. Fig. 1  depicts
the flow of the manuscript authoring process and we can see clearly how Zemanta
server works like. It is important to mention that the Zemanta service is no longer
involved after posting the authored work, for example when the content is being read
by other users.
4 Magdalena Jitca
Fig. 1. The flow of the authoring process with Zemanta
The entities involved in this process can be split into 5 categories, although it
might happen that a single person performs the actions corresponding to several
entities. The roles played by them and the way they interact are depicted in Fig. 2 
and described below.
Fig. 2. The distribution of the roles in the Zemanta system
• Author - the person which composes the content and improves it with Zemanta’s
• CMS creator - the person or organization developing Content management
software that integrates the services and experience offered by Zemanta
• Platform owner - person or organization owning the specific hosting platform on
which the CMS software runs
• Reader - a person or organization benefiting from the content
Zemanta - Blog me up! 5
• Zemanta – the service provider during content creation process
An example where roles overlap is the case of applications developed on top of the
Zemanta API. If a programmer works as a software developer (for the CMS), but also
runs the application and then creates content with it, he is playing three different
roles. This is the case of enthusiast developers experimenting with Zemanta, whose
results can be seen and tested in .
Taking a step back to understand how Zemanta’s content recommendation engine
works. Instead of running keyword based queries (as traditional search engines do), it
analyzes the whole text by means of different natural language processing techniques
and performs a deep understanding of the content. On this basis it identifies the
concepts in the text by connecting them to a semantic database (e.g. DBpedia,
MusicBrainz) and delivers the suggested related results. In section 4, an interesting
application based on Zemanta, DBpedia and Freebase will be presented, in order to
have a view of the internal representation of the content recommendation engine. It is
this unique combination of NLP machine learning, and fine tuning that makes it work
3.2 Suggestions in detail
As previously stated, Zemanta provides four types of content recommendations,
which will be discussed in this section. They are all plotted in Fig. 3, which is a print
screen done while composing a blog entry.
6 Magdalena Jitca
There are multiple sources which Zemanta uses as a basis for image suggestions.
Among the most widely used are Wikipedia, Getty and Flickr, but stock image
providers are also a good choice, as they provide images of higher aesthetical quality.
Because Zemanta uses Flickr API, it cannot use the advantage of Zemanta's internal
concept representation. This means it might happen that less topically accurate images
are suggested, although there would have been better suggestions. Each image
suggestion includes a “description” attribute. This is only a textual description of what
the image represents, but it may be inaccurate in some cases, especially because they
have been either poorly tagged or completely wrong. The image also includes an
“attribution” feature, which describes the source and the author of the image when
those are available.
3.2.2 Related articles
Zemanta allows an automatic search for related articles and full control of the author
over their inclusion. Zemanta aggregates articles from many different internet
sources, such as the major news sources (e.g. BBC and CNN) and over 10,000 blogs.
Judging from customers’ feedback, Zemanta has come to know that many authors
only read suggested related articles by themselves and use gained information to write
better content instead of explicitly linking their work to the suggested resources.
Although this doesn’t seem a proper use of Zemanta, this use case has been accepted
by the Zemanta community.
3.2.3 In-text links
In-text links present links inside the main body of text that lead the reader to
information about very specific concepts and topics that were directly mentioned. In
order to establish connections between specific concepts or topics and the considered
input text, Zemanta uses knowledge databases such as Wikipedia, IMDB, Rotten
Tomatoes, Amazon book listings and others similar. Links are not anchored to a
specific location in the text, but to substrings of the text. This is done because the
original text might change before the author decides to apply a link and it would be
extremely hard for the authoring software to store the bookmarks. That's why the
“anchor” attribute defines to whom the link should be attached.
Tag is a relevant keyword or term associated with a specific content. Labeling by
keywords has been used for a long time in scientific publications, but recently many
web services have gotten religion about tagging, because of its powerful way of
describing information with metadata. Even when lacking formal structure, tags can
provide valuable navigational enhancements and make the task of search engines
easier. However, it is still a problem when tagging isn’t done in a standardized way
(we have discussed about this in the previous sections). Zemanta offers both tags
based on words and phrases that can be found in the author's text and also those topics
that could represent the content as a whole, but are not explicitly mentioned.
Zemanta - Blog me up! 7
4. Test results
I have used Zemanta on several platforms, such as Wordpress for writing blogs and
Google Mail for composing e-mails and it has proven itself to be very useful. But the
longer I was enjoying the benefits of this tool, the more I wanted to know what it’s
behind it. That’s when I discovered LinkedGalaxy, an application built upon Zemanta
API, DBpedia and Freebase which allows you to visualize the input text as a graph.
Its nodes are the semantic entities corresponding to the concepts in the text. Fig. 4 and
5 represent print screens of the output of the application of the English text 
Fig. 3. The semantic concepts graph of the English text
Fig. 4. The complete graph of the English text
I have tried the same application for a text written in Romanian but which discusses
about world-known entities, for which I got a very poor output. The resulting graphs
for a Romanian text can be seen in Fig. 6. The result was predictable because I knew
that Zemanta doesn’t support multilingualism for the moment, however it was proven
right that the engine manages to relate some of the words to concepts and that is a step
towards further improvements which could include internationalization.
8 Magdalena Jitca
Fig. 5. The semantic concepts graph of the Romanian text