Calais PAWS Sep 4, 2008
Calais?
ClearForest Founded in 1998 by text analytics pioneers A software organization that enables Intelligent Information Enterprise and government customers Led the market in the establishment of unstructured text as a key corporate asset Acquired by Reuters June 2007 Offices: Boston, Israel
The Text Problem People consume  text Most of it  isn’t  semantically enabled Most of it  won’t be  semantically enabled Why: Latency, cost and short shelf-life
Calais’ Piece of the Puzzle A semantic metadata generation service that extracts entities, facts and events from unstructured text Two new capabilities: topics & relevance Available for commercial or non-commercial use up to 40,000 times per day Calais Named Entities Facts Events People, Companies, Geographies, Albums, Authors, etc. Position, Alliance, Education, Political Affiliation, etc. Management Change, IPO, Labor Action, Sporting, Entertainment etc. Unstructured Documents (Text / HTML / XML)
Reuters Announced the Acquisition of ClearForest  New York - April 30, 2007 Reuters, the global information company, has entered into an agreement to acquire all of the outstanding shares of ClearForest Ltd., a privately held provider of Text Analytics solutions, whose tagging platform and analytical products allow clients to derive precise business information from huge amounts of textual content. ClearForest has received sufficient shareholder approval to complete the transaction, which is expected to close in approximately 30 days, subject to customary closing conditions. The financial terms were not disclosed. Reuters plans to retain and continue to work with the existing management team and their highly skilled workforces in the US and Israel. It also plans to continue to support existing products and customers. Reuters believes that search will be a pivotal element to the future of how financial information is sourced and consumed. As part of its drive into this space, Reuters has created a new strategic group and appointed Gerry Campbell, who will oversee the integration of ClearForest and drive this innovation.  <Topic>M&A</Topic>  <Acquisition offset=&quot;494&quot; length=&quot;130&quot;>    <Company_Acquirer>Reuters</Company_Acquirer>     <Company_Acquired>ClearForest Ltd.</Company_Acquired>     <Status>Planned</Status>  </Acquisition> <Company>Reuters</Company>  <Company>ClearForest Ltd.</Company>  <Product>Text Analytic Solution </Product>   <Company>ClearForest Ltd.</Company>  <Company>Reuters</Company>  <Country>United States</Country>  <Country>Israel</Country>  <Company>Reuters</Company>  <Person>Gerry Campbell</Person>  <ManagementChange offset=&quot;2789&quot; length=&quot;92&quot;> <Person>Gerry Campbell</Person>  <Company>Reuters</Company>  <Action>Enters</Position>  </ManagementChange>
What’s Behind and Event … An Example Digital Marketing Services,Inc. (DMS), the leading provider of online marketing research and a division of America Online Inc. (AOL), today announced an alliance with Netcentives Inc. (Nasdaq: NCNT) Extracted instances: Company = Digital Marketing Services, Inc. Company = Netcentives Inc. Status = announced DateString = today Date = 2000-01-31
Live Example Viewer Demo Gnosis Demo
Extending Calais’ Reach More than just a web service – a growing collection of tools and applications to make it valuable in the real world Calais Browser Extensions Gnosis Content Management Tools WordPress Drupal UIMA Development  Tools & Libraries PHP Ruby JAVA .NET Applications And more… TopBraid RSS Tagger Powerhouse LinkedFacts Wirecatch FeedShaver
How Calais is Being Used Today Gist   Automatically aggregates multiple news sources and automatically slots them into topic, etc.
The Stack ClearForest Tags Platform File Based Connector Programmatic API (SOAP web Service) RDBMS  Connector Web Crawlers (Agents) Console Rich XML Live Feed Tooling Modeler Developer Cat Manager A F External Content/live feed/Enterprise Content ClearForest Extraction Modules B ClearForest Categorizer C
Detailed Stack Rich XML Rich XML ClearForest Tags Platform Files Document Conversion and Normalization Control DB Tags   API Control API File Based API Programmatic API (SOAP web Service) Web Agents RDBMS based API Enterprise System Categorizer Semantic Tagging Language ID Headline Generation Classifier Extraction Modules Language Classifier Templates Categorization Manager ClearForest Dvlpr/Modeler Languages Configuration Key Concepts Configuration ClearForest Studio Rich XML External Feed Configuration & Monitoring Console Farm Manager
Platform Highlights Single run-time platform for all  technologies   Modular architecture Additional functional plug-in can be added anywhere  Web services interfaces SOA ready Java based Programmatic API to all components Farming support for scalability Best practices/standards (XML, Unicode, Architectural Patterns, Design patterns …)
File API Programmatic API (SOAP web Service) RDBMS  based API Web Custom Document Tagging (Doc Runner) Categorization Information extraction Control Console Control API Tags Pipeline KB Writer DB Writer XML Writer IO Bound Rich XML ANS Collection DB Other (Headline Generation) Document Conversion Conversion  & Normalization PDF Conv. XML Conv. Doc Conv. File/Web/DB based API (Document Provider) Profile Listener Listener Listener Language identification Queues: CPU Bound Web Document Injector (flight plan) Technology
The NLP Stack Events & Facts Entities Candidates, Resolution, Normalization Basic NLP Noun Groups, Verb Groups, Numbers Phrases, Abbreviations Metadata Analysis Title, Date, Body, Paragraph Sentence Marking Morphological Analyzer POS Tagging (per word) Stem, Tense, Aspect, Singular/Plural Gender, Prefix/Suffix Separation Tokenization
Calais, Semantics and the Semantic Web Issues, Opportunities Ontologies How do we make this a community effort? Dereferenceable URI’s & Endpoints Engineering Population Basic data Links Proprietary data sources Functions? Code?
What’s in the Pipeline? 2008 The basics of de-referenceable URI’s Disambiguation – company & geography Hooks 2009 (this is a fuzzy list) Person disambiguation (social networks?) Other disambiguation Continued population of endpoints Calais as hub Exposure of the IDE User managed lexicons Lots and lots of hooks
www.opencalais.com Gallery – code and applications examples Forums Documentation

Calais @ the Palo Alto Semantic Web Meetup

  • 1.
  • 2.
  • 3.
    ClearForest Founded in1998 by text analytics pioneers A software organization that enables Intelligent Information Enterprise and government customers Led the market in the establishment of unstructured text as a key corporate asset Acquired by Reuters June 2007 Offices: Boston, Israel
  • 4.
    The Text ProblemPeople consume text Most of it isn’t semantically enabled Most of it won’t be semantically enabled Why: Latency, cost and short shelf-life
  • 5.
    Calais’ Piece ofthe Puzzle A semantic metadata generation service that extracts entities, facts and events from unstructured text Two new capabilities: topics & relevance Available for commercial or non-commercial use up to 40,000 times per day Calais Named Entities Facts Events People, Companies, Geographies, Albums, Authors, etc. Position, Alliance, Education, Political Affiliation, etc. Management Change, IPO, Labor Action, Sporting, Entertainment etc. Unstructured Documents (Text / HTML / XML)
  • 6.
    Reuters Announced theAcquisition of ClearForest New York - April 30, 2007 Reuters, the global information company, has entered into an agreement to acquire all of the outstanding shares of ClearForest Ltd., a privately held provider of Text Analytics solutions, whose tagging platform and analytical products allow clients to derive precise business information from huge amounts of textual content. ClearForest has received sufficient shareholder approval to complete the transaction, which is expected to close in approximately 30 days, subject to customary closing conditions. The financial terms were not disclosed. Reuters plans to retain and continue to work with the existing management team and their highly skilled workforces in the US and Israel. It also plans to continue to support existing products and customers. Reuters believes that search will be a pivotal element to the future of how financial information is sourced and consumed. As part of its drive into this space, Reuters has created a new strategic group and appointed Gerry Campbell, who will oversee the integration of ClearForest and drive this innovation. <Topic>M&A</Topic> <Acquisition offset=&quot;494&quot; length=&quot;130&quot;>   <Company_Acquirer>Reuters</Company_Acquirer>   <Company_Acquired>ClearForest Ltd.</Company_Acquired>   <Status>Planned</Status> </Acquisition> <Company>Reuters</Company> <Company>ClearForest Ltd.</Company> <Product>Text Analytic Solution </Product> <Company>ClearForest Ltd.</Company> <Company>Reuters</Company> <Country>United States</Country> <Country>Israel</Country> <Company>Reuters</Company> <Person>Gerry Campbell</Person> <ManagementChange offset=&quot;2789&quot; length=&quot;92&quot;> <Person>Gerry Campbell</Person> <Company>Reuters</Company> <Action>Enters</Position> </ManagementChange>
  • 7.
    What’s Behind andEvent … An Example Digital Marketing Services,Inc. (DMS), the leading provider of online marketing research and a division of America Online Inc. (AOL), today announced an alliance with Netcentives Inc. (Nasdaq: NCNT) Extracted instances: Company = Digital Marketing Services, Inc. Company = Netcentives Inc. Status = announced DateString = today Date = 2000-01-31
  • 8.
    Live Example ViewerDemo Gnosis Demo
  • 9.
    Extending Calais’ ReachMore than just a web service – a growing collection of tools and applications to make it valuable in the real world Calais Browser Extensions Gnosis Content Management Tools WordPress Drupal UIMA Development Tools & Libraries PHP Ruby JAVA .NET Applications And more… TopBraid RSS Tagger Powerhouse LinkedFacts Wirecatch FeedShaver
  • 10.
    How Calais isBeing Used Today Gist Automatically aggregates multiple news sources and automatically slots them into topic, etc.
  • 11.
    The Stack ClearForestTags Platform File Based Connector Programmatic API (SOAP web Service) RDBMS Connector Web Crawlers (Agents) Console Rich XML Live Feed Tooling Modeler Developer Cat Manager A F External Content/live feed/Enterprise Content ClearForest Extraction Modules B ClearForest Categorizer C
  • 12.
    Detailed Stack RichXML Rich XML ClearForest Tags Platform Files Document Conversion and Normalization Control DB Tags API Control API File Based API Programmatic API (SOAP web Service) Web Agents RDBMS based API Enterprise System Categorizer Semantic Tagging Language ID Headline Generation Classifier Extraction Modules Language Classifier Templates Categorization Manager ClearForest Dvlpr/Modeler Languages Configuration Key Concepts Configuration ClearForest Studio Rich XML External Feed Configuration & Monitoring Console Farm Manager
  • 13.
    Platform Highlights Singlerun-time platform for all technologies Modular architecture Additional functional plug-in can be added anywhere Web services interfaces SOA ready Java based Programmatic API to all components Farming support for scalability Best practices/standards (XML, Unicode, Architectural Patterns, Design patterns …)
  • 14.
    File API ProgrammaticAPI (SOAP web Service) RDBMS based API Web Custom Document Tagging (Doc Runner) Categorization Information extraction Control Console Control API Tags Pipeline KB Writer DB Writer XML Writer IO Bound Rich XML ANS Collection DB Other (Headline Generation) Document Conversion Conversion & Normalization PDF Conv. XML Conv. Doc Conv. File/Web/DB based API (Document Provider) Profile Listener Listener Listener Language identification Queues: CPU Bound Web Document Injector (flight plan) Technology
  • 15.
    The NLP StackEvents & Facts Entities Candidates, Resolution, Normalization Basic NLP Noun Groups, Verb Groups, Numbers Phrases, Abbreviations Metadata Analysis Title, Date, Body, Paragraph Sentence Marking Morphological Analyzer POS Tagging (per word) Stem, Tense, Aspect, Singular/Plural Gender, Prefix/Suffix Separation Tokenization
  • 16.
    Calais, Semantics andthe Semantic Web Issues, Opportunities Ontologies How do we make this a community effort? Dereferenceable URI’s & Endpoints Engineering Population Basic data Links Proprietary data sources Functions? Code?
  • 17.
    What’s in thePipeline? 2008 The basics of de-referenceable URI’s Disambiguation – company & geography Hooks 2009 (this is a fuzzy list) Person disambiguation (social networks?) Other disambiguation Continued population of endpoints Calais as hub Exposure of the IDE User managed lexicons Lots and lots of hooks
  • 18.
    www.opencalais.com Gallery –code and applications examples Forums Documentation

Editor's Notes

  • #2 First draft, with beautiful work by Sagit. Note that ALL text is editable.