Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Simple OpenCalais Whitepaper


Published on

Here is a simple, high-level whitepaper on the OpenCalais service. Feel free to ping us with questions on Twitter @OpenCalais or @KristaThomas.

Published in: Technology
  • Be the first to comment

Simple OpenCalais Whitepaper

  1. 1. Thomson Reuters Calais Web Service & the Linked Content Economy Executive Summary: The rise of the Internet has brought dramatic change to the publishing industry. While newspapers in particular struggle to adapt, advertisers are cutting budgets, seeking new efficiencies and increasingly using the Web to go straight to the consumer. Semantic technologies and new open data resources on the Web give both publishers and advertisers new tools and services that can help them succeed. The Thomson Reuters Calais Web service, found at, is one such service. Calais identifies and automatically tags the people, places, companies, facts and events “Calais turns static text into ‘Smart in text. It then forges connections between Media’ that is enriched with open data those entities and relevant data sets, media and connected to a dynamic ‘Linked files, Wikipedia entries and more on the open Content Economy’.” Web. Finally, it gives publishers a new way to share that tagged content with next generation -Thomas (Tom) Tague, Calais initiative lead search engines, news aggregators and others in the content ecosystem. Armed with this powerful new tool, forward-looking publishers are automating time consuming content operations and increasing editorial productivity. They are also enhancing the value of their content, improving their user experience and preparing to reach more readers in tomorrow’s media landscape – increasingly called the ‘linked content economy.’ Background: Calais is a strategic initiative at Thomson Reuters to advance the interoperability of content and support the company’s mission to provide pervasive intelligent information to its customers. Calais uses Natural Language Processing to give publishers free metatagging services, developer tools and an open standard for the generation of semantic content. The latest update to Calais – Calais 4.0 – is a significant advance on the initiative’s goals. The Calais team originally set out to help developers, bloggers and publishers automatically tag their content to improve search and navigation, and enable new reader engagement features. With Calais 4.0, the Calais Web service goes beyond metatagging to help publishers enhance their content, using open data from sources like Wikipedia, DBpedia. GeoNames, the Internet Movie Database (IMDB), and more. It also makes it easy for publishers to use
  2. 2. their metadata to share their content with next generation content consumers – such as search engines, news aggregators ‘related stories’ service and more – to ultimately reach more readers. With these added capabilities, Calais helps content creators and content consumers alike connect to the rapidly emerging ‘Linked Content Economy’ and deliver ‘Smart Media.’ The Linked Content Economy & Smart Media: The Linked Content Economy is an evolving ecosystem of enriched and connected content that helps publishers engage readers, improve the user experience, and – ultimately – better convert readership to revenue. Linked Content goes beyond ‘link journalism,’ (linking to related stories, etc.). It uses metadata to help publishers create “Smart Media” – content that automatically connects the concepts, people, companies, etc. it contains to a rich array of related data sets and media assets on the Web. It then uses metadata to help publishers share their “Smart Media” with the rest of the content ecosystem, including search engines, news aggregators, ‘related stories’ applications and more. How it Works: 1. Publishers submit content to the Calais Web service using their Calais API key. 2. Calais tags each person, place, fact and event in the content, making it machine- readable and interoperable on the Web. 3. Each piece of content - and each entity or event in that content - is assigned a unique identifier (a document ID and many URIs) that ties back to the Linked Data Cloud. 4. Publishers use the metadata Calais returns (tags, document IDs and URIs) to enhance their content and create features like topic pages that improve the reader experience. 5. Publishers can also use their metadata to share their content with next generation search engines, news aggregators, etc. Calais’ participation in this ecosystem is as a platform. Calais lays the foundation on which, in conjunction with Content Management Systems, users can create a next generation publishing site, service or community. Calais adopted the Linked Data standard to build a back-end infrastructure and repository, enabling linkage between concepts and documents. Linked Data is a standard promulgated by Sir Tim Berners-Lee. Here are some of the open data assets in the Linked Data cloud.
  3. 3. By embracing the Linked Data standard —and by creating a Calais repository of Linked Data assets on publicly-traded companies — Thomson Reuters has built scaffolding that enables Web sites, social networks and other content-rich applications to navigate between previously separate silos of data and information. Here’s how it works: 1.) When Calais processes an article, it extracts many named entities. For some classes of named entities, such as companies, Calais now also returns an HTTP hyperlink, called a Uniform Resource Identifier (URI). 2.) This hyperlink points into the Calais repository, to a machine readable XML page containing related content (company description, management team, board of directors, etc.) as well as links to related assets in DBpedia, from Thomson Reuters, etc. 3.) This linked data infrastructure forms a web-of-links that applications can navigate and use to pull information up for display or integration into the user experience. Calais has thus created a lingua-franca to drive content interoperability, and provided a simple “Calais provides a transportation layer standard for the sharing of rich semantic metadata that enables users to share their semantic metadata with downstream consumers Here’s an example: like search engines, news aggregators, A news story breaks on an IBM earnings report. ‘related stories’ applications and more.” The user wants to find out if IBM has any affiliation with Warren Buffett of Berkshire Hathaway. -Thomas (Tom) Tague, Calais initiative lead Today such a complex query requires time-consuming research. Search engines can’t hopscotch through content.
  4. 4. But with Calais: 1. The news application sends the story to Calais. 2. Calais extracts IBM from the news story, ties it to International Business Machines Corporation in the Linked Data cloud and returns the URI (i.e. hyperlink) for IBM 3. The app. uses the IBM URI to retrieve the list of the Board of Director members from the content in the Calais repository 4. The app. queries the Board members for their other affiliations and finds a member that is also on the Board of Coca Cola plus a member that is the CEO of American Express 5. The app. runs a query of shareholders of Coca Cola and finds Berkshire Hathaway. 6. The app. runs a query on shareholders of American Express and finds Berkshire Hathaway. IBM Corporation Board of Directors Cathleen Black Cathleen Black William Brody Kenneth Chenault Other Affiliations Michael Eskew President, Hearst Magazines Board Member, Coca Cola Berkshire Hathaway Key Stockholders Management Team Kenneth Chenault Berkshire Warren Buffett Other Affiliations Charlie Munger CEO, American Express American Express Key Stockholders Berkshire Hathaway Semantic extraction is far more powerful than keyword search, which can confuse Paris (Texas), Paris (France) and Paris (Hilton). Calais can determine that the Paris in this particular article is Paris Texas based on sophisticated disambiguation that leverages a variety of clues in the text. New Applications: Calais 4.0 and beyond will enable many emergent applications including: - Publisher sites that dynamically mingle and deliver additional relevant content based on user preferences, profiles, history, friends’ selections and breaking topics that are hot now. - Media Monitoring tools that deliver slices of relevant information, e.g. content from all sites and blogs discussing natural disasters occurring near iron mines in Southeast Asia. - Plug-ins that integrate social networking / community / blogging, and bypass search. - Semantic ad networks and servers that go beyond keywords to inform ad placement with context, e.g. preventing airline ads from appearing next to news of air accidents. Conclusion: Armed with this powerful new tool, publishers are automating content operations, increasing productivity and cutting costs. They are enhancing the value of their content, improving their user experience and preparing to lead in the linked content economy. No-one can predict precisely what kinds of creative and potentially game-changing applications will emerge. With more than nine thousand users in the community, Thomson Reuters expects to see hyper-evolution in many arenas. 