Your SlideShare is downloading. ×

Open Calais


Published on

This is a short study about OpenCalais and linked data

This is a short study about OpenCalais and linked data

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Study about OpenCalais API practical usage in linked data context Căciulă Maricel „Faculty of Computer Science, A. I. Cuza Univesrity of Iasi” Abstract. A presentation of OpenCalaisses. Here will be a short describtion of the web service API and will be presented some projects that are using this API. At the end it will be showed some personal ideas of the API usage. Keywords: Web Service, API, resource management, linked data.
  • 2. 2 Căciulă Maricel 1 Introduction OpenCalais is a project that makes your text more valuable. It enables you to identify named entities, facts and events and returns a Resource Description Framework formatted result. This project was initiated by Thomson Reuters, and at the beginning, it was aiming to eliminate the manual tagging step for publishers. In time, OpenCalais proved to be useful improving user search experience and lately was used to generate content hubs. OpenCalais is free to use, and can be accessed up to 40.000 times per day. It can be used in commercial and noncommercial applications. The motivation behind open free usage is to improve their natural language processing tools, and to semantically link the web content. 1.1 OpenCalais Web Service OpenCalais can be accessed through a web service. It supports SOAP, REST and HTTP Trafic compressions Accessing through SOAP can be done using the web method “Enlighten” on this URL : String Enlighten(String licenseID, String content, String paramsXML) The parameters are described in the following table as in the official Calais Soap documentation : Field Name Type Definition Notes licenseID String API access key Optain through registration Content String Content to be annotated Max input length is 100,000 characters paramsXML String Processing and user directives and external Max parameters length is metadata 16.000 characters Accessing through REST can be done at the following URL :
  • 3. Study about OpenCalais API practical usage in linked data context 3 LicenceID=url-encoded-string&content=url-encoded-string&paramsXML=url-encoded-streams This can be used with GET, adding the argument lines to the rest URL, or with POST and including the argument line in the html body. There is a nice tutorial example on the official site: http: //opencalais/files/ Accessing through HTTP Trafic Compression can be done using a Gzip request.The client should add the “Accept-encodeing:gzip” header to the web request in order to tell the server that the client can handle a gzip response 1.2 Open Calais API The input API parameters are set in XML format. The parameters refer o process directives, user directives and external metadata. The entire input XML (meaning the paramasXML) must be HTTP encoded (escaped) Here is a table that describes the API input parameters from the official OpenCalais API documentation : Parameter Section Definition Values Default “TEXT/HTML” contentType Processing Format of the input ”TEXT/XML” Directives content “TEXT/HTMLRAW” None “TEXT/RAW” Format of the returned “XML/RDF”, outputFormat Processing results “Text/Simple” Directives “Text/Microformats” XML/RDF “Application/JSON” Base URL to be put in reltagBaseURL Processing Rel-tag microformats <the base URL>, for Directives example None “”
  • 4. 4 Căciulă Maricel Indicates wheter the Processing extracted metadata calculateRelativeScore Directives will include relevance “true” or”false” True score for each unique entity I ndicates wheter output “GenericRelations” Processing will include Generic “SocialTags” enableMetadataType Directives Relation extraction “GenericRelations,Socia None (RDF) and/or lTags” Social/Tags Indicates whetherentire Processing XML/RDF document docRDFaccessible Directives is saved in the Calais “true” or “false” None Linked Data repository User Indicates whether the allowDistribution Directives extracted metadata “true” or “false” False can be distribuied Indicates whether User future searchers can allowSearch Directives be performed on the “true” or “false” False extracted metadata User-generated ID for externalID User the submission Any string None Directives User Indentifier for the Submitter Directives content submitter Any string None The Input Content can be TEXT/HTML, TEXT/HTMLRAW, TEXT/XML, TEXT/RAW. If the content type is not specified , then Calais tries to auto detect the type. As a default language Open Calais uses English, but also supports French and Spanish. If the input text is smaller the 100 characters, then the default language is used.
  • 5. Study about OpenCalais API practical usage in linked data context 5 The API can also be used with SSL , accessing through https. 1.3 Data structure OpenCalais returns the response by default in RDF format. The RDF header includes a summary of all entities extracted from the text and sorted alphabetically based o the Entity type. INFORMATION For each unique element, the information includes the element type (that can be a Company, Person, Acquisition for example) attribute values and ID of a unique element We can enable the Relevance feature and the result RDF will also include the relevance score for this unique entity When an attribute value is refered to by its ID, it will include a comment containing the actual value for easier understanding INSTANCES As we can see on the official documentation, one or more individual instances (mentions) for each unique metadata element. Each element instance includes the following : c:docId: URI of the document this mention was detected in c:subject:URI of the unique entity c:detection:snippet of the input content where the metadata element was indentified c:prefix:snippet of the input content that precedes the current instance c:exact:snippet of the input content in the matched portion of text c:offset: the character offset relative to theinput content after it has been converted into xml c:length:length of the instance 1.4 OpenCalais and linked data With the last significant update on OpenCalais, the 4.0 version, users are now able to connect to the Linked Data web standard. Linked data is a method of exposing , sharing, and connecting data through deferenceable URIs on the web. To be compatible, OpenCalais respects the four principles of linked data. - It has URIs to identify things.
  • 6. 6 Căciulă Maricel - It usesHTP URIs so that these things can be reffered to and looked up by people and user agents - It provides useful information (structures description - metadata) about the thing when is URI is deferenced - It include links to other URIs in the exposed data to improve discovery of other related information on the Web In the image shown beneath, we see the latest instance linkage within the Linking Open Data datasets. Here we can see the OpenCalais . The Calais ecosystem is exposed via Linked Data endpoints and when it extracts an entity from a given text it also returns a entity URI. This URI is deferenceable. You can submit an HTTP request programmatically or through browser, and get in response useful information and links to other Kinked Data and web assets. As we can see on the official site, OpenCalais is linked at this moment to the following assets : - DBpedia - Wikipedia - Freebase - - GeoNames - - IMDB - LinkedMDB
  • 7. Study about OpenCalais API practical usage in linked data context 7 2 Practical usage OpenCalais used primarily for tagging blogs and word press articles. As it’s founder says, they noticed that the OpenCalais project is used for other purposes like creating content hubs Open Calais can be used to : Triage – Filter large influx of content Workflow – Use metadata returned from OpenCalais to route documents to the right person/system Content Enhancement – OpenCalais can be the entry point for the huge world of linked data. Alerting – Allow advanced alerting giving the users the ability to interact more naturally with the user application Media Monitoring – Take in a content feed (social media, press releases , news) can be categorized and organized using OpenCalais. Content Harmonization – Mixing different sources of information that can be integratied in a CMS (Content Management System) Automated News Portal – Publish relevant information taken from different sources after are filtered using OpenCalais SEO – Improving search News Presentation- With consistent metadata extraction it is possible to create new navigation and search tools on your site 2.1 Blog tagging As we expected, one of the first implementation based on OpenCalais was designed for bloggers. Tagaroo is a tool initiated by the same OpenCalais team and it’s a plugin for blogger site.This tool makes better your blog by improving
  • 8. 8 Căciulă Maricel both the user experience and searchability. This tool analyzes you text , as you are writing and suggests intelligent tags for the things and events you are writing about. A nice ability that this tool provides is to use the generated tags to automatically get images from Flicker to include your post. Link : Another site that is using OpenCalais for blog tagging is “Al Jazeera English’ new blogging network”. All posts in the new blog are semantically tagged using OpenCalais for optimal search and navigation. Link: http://blogs.aljazeera,net I *heart* Sea is hyperlocal news aggregation site that collects some of the best blogs in Seatle. It uses OpenCalais to automatically tag the keywords of the blog posts in aggregates, to make it easier to find related information. Link: 2.2 Press tagging The new websited from “The New Republic” is using an OenCalais-enabled Drupal- powered Content Management System to increase editorial productivity and improve search engine optimization Link: The “Slate Magazine’s News Dots Network” visualizes the most recent topics in the news as a concise network of related topics Like a human social network, the ews tends to cluster around popular topics, and most stories are more closely related than one might think. In the background, the News Dots scans all the articles from major publications and submits them to OpenCalais to identify the relevant people, places, companies, topics, etc Link: 2.3 Media monitoring Tattler is an open source topic monitoring tool for the Web. Tattler finds and aggregates content from the web on topics users ask it to monitor. In background it uses OpenCalais together with other Semantic Web technologies. It mines news, websites, blogs, multimedia sites, and other social media like Twitter, to find
  • 9. Study about OpenCalais API practical usage in linked data context 9 mentions of the issues, most relevant to user’s selected topics , making easy for user to filter, organize, share and take action on content gathered from the Web. Link : Interceder is a social media monitoring tool that makes it easy to track trending topics and search through the latest content from major news websites, blogs, twitter and youtube. Link: AskJot is a tool for analyzing web pages fro keywords and displaying as links to search results from various services around the web. Behind the scene Ask Jot uses OpenCalais, NYT articles search API, DBPedia, Yahho! Answers API, the flicker API and others. 2.4 Intelligent Content Feedly is a Firefox plugin that brings to life user-selected inputs from Google Reader, friendfeed, Twitter, RSS feeds and others in a easy to read and engaging magazine style format. It uses OpenCalais and other semantic techonologiesfor clustering, linking and organizing the content experience in an intuitive fashion that is nicely integrated into the browsing experience. Link : OpenPublish is based on the Drupal platform and it is a next generation CMS that has been tailored to the needs of today’s online publishers (magazines, newspapers, journals, trade publications, broadcast and wire services). It uses metatagging from OpenCalais to streamline content operations, automatically create topic hubs and recommended related articles and archived more from the same authors stories Link: DocumentCloud was found by The New York Times and ProPublic . DocumentCloud is a unique online resource that offers public access to news reporters’ original source materials, including documents, media files and more. OpenCalais processes materials available through DocumentCloud to make it easy for users to explore connections between newsmakers, corporations, transactions and even quotations across documents and across the full collection of sources. Link :
  • 10. 10 Căciulă Maricel 3 .Personal of OpenCalais API usage idea As I tested the OpenCalais API with Document Viewer, I notice that on short text the relevance is not accurate. For example using the text from twitter will prove that. It will add irrelevant topics and social tags. Testing OpenCalais on big text, it took several hours to process. That is not ok. This means is not reliable for books and other big length texts. Finally I arrived at the conclusion that the optimal text should be from an article, or blog, that has more then 100 words and is smaller than 2 pages. 3.1 Blog tagging and filter Manual tagging of blogs in not always the best way to describe a post . I’ve seen blogs that are not completely described, omitting some key words that could be essential to find what are interested in, and, as people could see in personal way the things, they can tag the same post with different key words. Essentially, the idea was to try tagging blogs using semantic web (OpenCalais). Using as many blogs are possible, manually added or through a crawler that will recursively add new blogs(using the contacts from friends or persons that added a comment ). The easiest way to watch blogs is to use the RSS feed. In this way we can gather blogs from different sites in a standard way. Creating a new service that gathers posts from blogs and tags the text using the OpenCalais , we can create a database of feeds. Having such a database we can do a site/application that could enable a user to create a new custom generated RSS feed from the entire database. This way a user can see posts he is interested in from thousands of blogs. The generated result RSS feed can be consumed by the already existing applications for RSS. The original idea in this is that you could se posts from thousands of common blogs and filter by semantic tags
  • 11. Study about OpenCalais API practical usage in linked data context 11 3.2 Language abstraction Other interesting idea is to abstract the language. Right now the supported language for OpenCalais are English, Franch and Spanish. A interesting idea it was to semantic tag Romanian , or other language texts. I was thinking to integrate the google translate service from Google with the OpenCalais. The idea is to translate first the text from blogs or news and then to use OpenCalasi to semantic tag . This could not be reliable as the translation could not be so accurate but for texts larger the 100 words will probably tag correct the most relevant tags. This because the translation will translate ok the key words. 4 References 1. 2. http:;// 3. http://facebook/note.php?none_id=160609314491 4. 5. 6. 7. 8. wKs7ImKAg&q=opencalais#
  • 12. 12 Căciulă Maricel