Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Zemanta

995 views

Published on

Zemanta - a content recommendation tool for bloggers: basic principles and features, architecture and test cases.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Zemanta

  1. 1. Zemanta - Blog me up! Magdalena Jitca Faculty of Computer Science, Iasi, Romania magdalena.jitca@info.uaic.ro Abstract. There is an augmenting need to help authors publish their content online, but there are still unsolved issues regarding the enrichment of the content and making it more readable, discoverable, and interconnected. Zemanta is a tool for content-understanding and recommendation in real time, widely used by bloggers because of its relevant suggestions which make blogging more fun. According to their estimations, it takes only 30 seconds to have an article published via Zemanta assistance. Keywords: Zemanta, content recommendation, social web software 1. Introduction Making your work known and visible to the world is most of the times a long-term process, if we consider the long-winded procedures behind it. It is not just a matter of content, but of how you present it and relate it to the past works in the field. Finding the appropriate references and definitions, as well as suggestive images and tags to describe your article is time-consuming. Besides, this usually requires skills and in- depth knowledge of specifications and standards. What if there would be an automatic way of dealing with all these activities, leaving you more time for your research and the creative process? The solution comes from Zemanta Ltd., which provides a free tool for content-understanding and recommendation in real time (while you are writing your post) by performing a semantic analysis of the input text and providing as output related content, pictures, links, and tags. From now on, it’s the author’s job to choose from the content recommended by the Zemanta engine the one needed. The result is not only an attractive, user-friendly design, but also an efficient relation between your content and some external data (that computers can understand). The target group is composed of social web software users (e.g. blog networks, blog farms), professional publishers and also persons with programming skills. This article will further on focus on bloggers and how Zemanta presents different types of interaction. Zemanta is available for downloading in several formats from their website [1] as browser extension (for Firefox 2&3, Internet Explorer 7&8, Chrome, Safari etc.), as server-side plug-in (for Wordpress, Blogger, Joomla etc.) and as API for developers.
  2. 2. 2 Magdalena Jitca 2. Basic principles and features 1. Recommendation of related content Zemanta searches the content pool for related articles, links, and images, and proposes the most relevant ones to the author. They will show up in a special sidebar which constantly keeps updating itself while editing your manuscript. The recommendations are more accurate for entries longer than 300 words. It is up to you to decide afterwards if the suggestions are the expected ones and how correct they are. Zemanta is English-only for now, but I have obtained good results when writing in another language if the content written about involves trademarked items or well-known unique entities (personalities, places, companies etc). The comparison will be presented in section 4. 2. Large-scale knowledge database Zemanta disposes of more than 10,000 news sources, connecting 100 Million content objects. The articles it suggests come from hundreds of top media sources on the web, as well as from the social networks and other blogs of Zemanta users. The images suggested come from Wikimedia Commons, Flickr, and stock photo providers like Shutterstock and Fotolia. Zemanta, like many other applications, uses Wikipedia as a kind of expert system. For example, if a page is linked to from a Wiki page, it is for certain that the page is relevant to the topic of the Wikipedia page. That kind of approach can be used for many different tasks, all with the goal of making the web and web services smarter. 3. Auto-tagging Tagging is not an easy task, but doing it right helps the web grow smarter by marking up pages, posts, videos, images, and other objects available on the web. Zemanta can automatically tag the content into general categories among which you can choose the ones you need. However, it seems that humans are still more efficient than computers at tagging, that’s why Zemanta made it possible for its users to customize the categorization. Beside the benefits of making tagging suitable for your own needs, the choice you make will further be used by Zemanta for refining the recommendation results. 4. Uploading Custom Content A client can help improve the content pool by making use of his own experience and previous work. He can upload RSS feeds which will later be indexed by Zemanta Enterprise and thus be included in the content pool. The only condition that the content must fulfill is to be either original or licensed under copyright. 5. Customizing the Recommendation Pool It is also possible to select the content to be included in the recommendations you get from Zemanta (e.g. limiting links to own network and trusted sources). Because it is
  3. 3. Zemanta - Blog me up! 3 so well integrated into the blogging platforms, it offers personalization targeted at bloggers. They can define their own blog sources to browse for and import their Twitter/Facebook/MyBlogLog contacts for automatic recognition while writing. 6. Copyright filtering Zemanta also pays close attention to the copyright legislation, making sure that suggested content is licensed as Creative Commons or approved by third parties, so the user won’t have any problem by using Zemanta's service. It is mostly the case of images you have to pay attention to, because tags, for example, are generally not regarded as creative work and therefore are not protected under copyright terms of service. 7. Re-blogging Zemanta offers a special feature of cross-platform quoting for blogs by means of different techniques to obtain the raw body of the post intended for quoting. One of them is via HTTP referrer and the second one is via a “request id” that is passed as part of the URL. For example, you can have your finished article submitted to a blog of your choice (by supplying the username and the password). 3. Architectural details 3.1 Zemanta system architecture From the architectural point of view, Zemanta web service is a server that stores the content received from the application and, when requested, sends suggestions of related content. This communication is based on a HTTP protocol and makes use of the standard JSON and XML response formats. Authoring applications such as content management systems then provide the suggested content to the author, so he can select the appropriate information to merge into his manuscript. Fig. 1 [2] depicts the flow of the manuscript authoring process and we can see clearly how Zemanta server works like. It is important to mention that the Zemanta service is no longer involved after posting the authored work, for example when the content is being read by other users.
  4. 4. 4 Magdalena Jitca Fig. 1. The flow of the authoring process with Zemanta The entities involved in this process can be split into 5 categories, although it might happen that a single person performs the actions corresponding to several entities. The roles played by them and the way they interact are depicted in Fig. 2 [2] and described below. Fig. 2. The distribution of the roles in the Zemanta system • Author - the person which composes the content and improves it with Zemanta’s suggestions • CMS creator - the person or organization developing Content management software that integrates the services and experience offered by Zemanta • Platform owner - person or organization owning the specific hosting platform on which the CMS software runs • Reader - a person or organization benefiting from the content
  5. 5. Zemanta - Blog me up! 5 • Zemanta – the service provider during content creation process An example where roles overlap is the case of applications developed on top of the Zemanta API. If a programmer works as a software developer (for the CMS), but also runs the application and then creates content with it, he is playing three different roles. This is the case of enthusiast developers experimenting with Zemanta, whose results can be seen and tested in [3]. Taking a step back to understand how Zemanta’s content recommendation engine works. Instead of running keyword based queries (as traditional search engines do), it analyzes the whole text by means of different natural language processing techniques and performs a deep understanding of the content. On this basis it identifies the concepts in the text by connecting them to a semantic database (e.g. DBpedia, MusicBrainz) and delivers the suggested related results. In section 4, an interesting application based on Zemanta, DBpedia and Freebase will be presented, in order to have a view of the internal representation of the content recommendation engine. It is this unique combination of NLP machine learning, and fine tuning that makes it work so well. 3.2 Suggestions in detail As previously stated, Zemanta provides four types of content recommendations, which will be discussed in this section. They are all plotted in Fig. 3, which is a print screen done while composing a blog entry.
  6. 6. 6 Magdalena Jitca 3.2.1 Images There are multiple sources which Zemanta uses as a basis for image suggestions. Among the most widely used are Wikipedia, Getty and Flickr, but stock image providers are also a good choice, as they provide images of higher aesthetical quality. Because Zemanta uses Flickr API, it cannot use the advantage of Zemanta's internal concept representation. This means it might happen that less topically accurate images are suggested, although there would have been better suggestions. Each image suggestion includes a “description” attribute. This is only a textual description of what the image represents, but it may be inaccurate in some cases, especially because they have been either poorly tagged or completely wrong. The image also includes an “attribution” feature, which describes the source and the author of the image when those are available. 3.2.2 Related articles Zemanta allows an automatic search for related articles and full control of the author over their inclusion. Zemanta aggregates articles from many different internet sources, such as the major news sources (e.g. BBC and CNN) and over 10,000 blogs. Judging from customers’ feedback, Zemanta has come to know that many authors only read suggested related articles by themselves and use gained information to write better content instead of explicitly linking their work to the suggested resources. Although this doesn’t seem a proper use of Zemanta, this use case has been accepted by the Zemanta community. 3.2.3 In-text links In-text links present links inside the main body of text that lead the reader to information about very specific concepts and topics that were directly mentioned. In order to establish connections between specific concepts or topics and the considered input text, Zemanta uses knowledge databases such as Wikipedia, IMDB, Rotten Tomatoes, Amazon book listings and others similar. Links are not anchored to a specific location in the text, but to substrings of the text. This is done because the original text might change before the author decides to apply a link and it would be extremely hard for the authoring software to store the bookmarks. That's why the “anchor” attribute defines to whom the link should be attached. 3.2.4 Tags Tag is a relevant keyword or term associated with a specific content. Labeling by keywords has been used for a long time in scientific publications, but recently many web services have gotten religion about tagging, because of its powerful way of describing information with metadata. Even when lacking formal structure, tags can provide valuable navigational enhancements and make the task of search engines easier. However, it is still a problem when tagging isn’t done in a standardized way (we have discussed about this in the previous sections). Zemanta offers both tags based on words and phrases that can be found in the author's text and also those topics that could represent the content as a whole, but are not explicitly mentioned.
  7. 7. Zemanta - Blog me up! 7 4. Test results I have used Zemanta on several platforms, such as Wordpress for writing blogs and Google Mail for composing e-mails and it has proven itself to be very useful. But the longer I was enjoying the benefits of this tool, the more I wanted to know what it’s behind it. That’s when I discovered LinkedGalaxy, an application built upon Zemanta API, DBpedia and Freebase which allows you to visualize the input text as a graph. Its nodes are the semantic entities corresponding to the concepts in the text. Fig. 4 and 5 represent print screens of the output of the application of the English text [4] Fig. 3. The semantic concepts graph of the English text Fig. 4. The complete graph of the English text I have tried the same application for a text written in Romanian but which discusses about world-known entities, for which I got a very poor output. The resulting graphs for a Romanian text can be seen in Fig. 6. The result was predictable because I knew that Zemanta doesn’t support multilingualism for the moment, however it was proven right that the engine manages to relate some of the words to concepts and that is a step towards further improvements which could include internationalization.
  8. 8. 8 Magdalena Jitca Fig. 5. The semantic concepts graph of the Romanian text References 1. http://www.zemanta.com 2. http://developer.zemanta.com/docs/Zemanta_API_companion 3. http://developer.zemanta.com/showcase 4. http://www.time.com/time/specials/packages/article/0,28804,1937994_1938235,00 .html?cnn=yes

×