Text Analytics Past, Present & Future


Published on

Text Analytics Past, Present & Future: keynote presentation by Seth Grimes at the TEMIS User Conference, Barcelona, July 9, 2009.

Published in: Technology, Education
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Text Analytics Past, Present & Future

  1. 1. Text Analytics Past, Present & Future<br />Seth Grimes<br />
  2. 2. &gt;&gt;Past, Present & Future<br />He who controls the present, controls the past. He who controls the past, controls the future.<br />-- derived from George Orwell’s 1984<br />
  3. 3. &gt;&gt; The Present: Today’s Market<br />I have estimated a $350 million global market in 2008, up 40% from $250 million in 2007.<br />Covers software licenses, vendor provided support and professional services.<br />$(hundreds) million more value created by:<br />Universities and research centers, especially in the life sciences.<br />Government, particularly for intelligence & counter-terrorism.<br />OEM licensees, for listening platforms, e-discovery, etc.<br />Systems integrators and consultants.<br />
  4. 4. &gt;&gt; Applications Today<br />Broadly grouped --<br />Intelligence and counter-terrorism.<br />Life sciences.<br />Content management, publishing & search.<br />Customer & market intelligence.<br />E-discovery.<br />Enterprise feedback.<br />Law enforcement.<br />Risk, fraud, compliance, and investigation.<br />
  5. 5. &gt;&gt;On the Demand Side…<br />How do current and prospective users see the market?<br />I recently published a study report, “Text Analytics 2009: User Perspectives on Solutions and Providers.” Drawing from the findings…<br />
  6. 6. &gt;&gt; Primary Applications<br />What are your primary applications where text comes into play?<br />
  7. 7. &gt;&gt; Primary Applications<br />Results found by Fern Halper of Hurwitz & Associates.<br />
  8. 8. &gt;&gt; The “Unstructured Data” Challenge<br />Sources are highly varied –<br /><ul><li>Web sites, news & journal articles, images, video.
  9. 9. Blogs, forum postings, and social media.
  10. 10. E-mail, Contact-center notes and transcripts; recorded conversation.
  11. 11. Surveys, feedback forms, warranty & insurance claims.
  12. 12. Office documents, regulatory filings, reports, scientific papers.
  13. 13. And every other sort of document imaginable.</li></li></ul><li>&gt;&gt; Important Sources<br />What textual information are you analyzing or do you plan to analyze?<br />Currentusers responded:<br />
  14. 14. &gt;&gt; Finding Business Value<br />Why? In customer-experience initiatives, for example, “more unsolicited, unstructured data [implies] increasing use of text analytics.”<br />-- Bruce Temkin, Forrester Research<br />
  15. 15. &gt;&gt; Information in Text<br />Do you need (or expect to need) to extract or analyze:<br />
  16. 16. Please rate your overall experience -- your satisfaction.<br />Fern Halper of Hurwitz & Associates found in her 2009 survey, “all of the companies that had deployed text analytics stated that the implementations either met or exceeded their expectations.  And, close to 60% stated that text analytics had actually exceeded expectations.”<br />&gt;&gt;TextAnalytics Satisfaction<br />
  17. 17. &gt;&gt; Today’s Text Analytics Players<br />Data mining and analytics.<br />Enterprise- and specialized-application focus.<br />Search tools and services.<br />Software-tool, OEM suppliers.*<br />Text analytics pure-plays, diverse applications.*<br />Web services.<br />* TEMIS categories.<br />
  18. 18. &gt;&gt; Today’s Text Analytics<br />Contrast with the 1999 landscape –<br />“The nascent field of text data mining (TDM) has the peculiar distinction of having a name and a fair amount of hype but as yet almost no practitioners.”<br />-- Prof. Marti A. Hearst,<br />“Untangling Text Data Mining,” 1999<br />(For our purposes, “text analytics” = “text mining” = “text data mining.”)<br />
  19. 19. &gt;&gt;What’sPastis Prologue<br />“Don&apos;t look back. Something might be gaining on you.”<br />-- Satchel Paige<br />
  20. 20. &gt;&gt; Understanding the Challenge<br />Marti Hearst in 1999:<br />“Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically.”<br /> “[A] way to view text data mining is as a process of exploratory data analysis that leads to the discovery of heretofore unknown information, or to answers for questions for which the answer is not currently known.”<br />Challenges: Access, decoding, discovery, application.<br />
  21. 21. &gt;&gt; In Business Terms<br />Business intelligence (BI) as defined in 1958:<br /> “In this paper, business is a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera... The notion of intelligence is also defined here... as ‘the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.’”<br />-- Hans Peter Luhn, <br />“A Business Intelligence System,”<br />IBM Journal, October 1958<br />
  22. 22. Document input and processing<br />Information extraction<br />Knowledge management<br />H.P. Luhn, “A Business Intelligence System,” IBM Journal, October 1958<br />
  23. 23. &gt;&gt;StatisticalAnalysis of Content<br />“Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance.”<br />Hans Peter Luhn, “The Automatic Creation of Literature Abstracts,” <br />IBM Journal, April 1958<br />
  24. 24. &gt;&gt;SignificancefromSemantics<br />“This rather unsophisticated argument on ‘significance’ avoids such linguistic implications as grammar and syntax... No attention is paid to the logical and semantic relationships the author has established.”<br />-- Hans Peter Luhn, 1958<br />
  25. 25. &gt;&gt; Methods<br />Technologists developed approaches to taming text:<br />Vector-space representations.<br />Salton, Wong & Yang, 1975,<br />“A Vector Space Model for Automatic Indexing.” <br />Clustering & classification algorithms.<br />Naive Bayes.<br />Support Vector Machine.<br />K-nearest neighbor.<br />Linguistic methods.<br />Machine learning.<br />
  26. 26. &gt;&gt; Looking Ahead<br />
  27. 27. &gt;&gt;Market Trends<br />“The Diverse and Exploding Digital Universe,” (IDC, 2008)<br />Stronger than ever:<br />Life sciences.<br />Intelligence & counter-terrorism.<br />Continued steep growth:<br />Media & publishing.<br /><ul><li>Seek to mine and to classify/process.
  28. 28. For users, semantic annotations ease navigation and boost findability.</li></ul>Customer experience.<br /><ul><li>Key to quality, satisfaction.</li></ul>Market intelligence including competitive intelligence.<br /><ul><li>Aggregates and details are both important.</li></li></ul><li>&gt;&gt;Technology Initiatives<br />Now and near future.<br />Semantic search.<br />Guha (IBM), McCool (Stanford), Miller (W3C): “The addition of explicit semantics can improve [navigational and research] search” (2003).<br />Question answering.<br />Matthew Glotzbach, Google: “Question answering is the future of enterprise search” (2006).<br />Sentiment analysis.<br />Bing Liu, Univ of Illinois: “The Web has dramatically changed the way that people express their views and opinions.”<br />
  29. 29. &gt;&gt;Technology Initiatives 2<br />Now and near future.<br />Listening platforms.<br />Bruce Temkin, Forrester Research: “The future is clearly about analyzing feedback in any form that your customers give it. That’s a trend that won’t go away.” <br />Text visualization.<br />We’re still coming to terms with the idea of actually extracting and exploiting the information content of rich media.<br />Web 3.0 & the Semantic Web.<br />Ronen Feldman, Bar-Ilan University and Hebrew University: “Text analytics [is] driving the Semantic Web” (2006).<br />
  30. 30. &gt;&gt; Search, from Keywords to Intelligence<br />Text analytics enables smarter search that better responds to user goals.<br />
  31. 31. &gt;&gt; Question Answering<br />Text analytics (information extraction) feeds curated knowledge bases.<br />
  32. 32. &gt;&gt;Sentiment Analysis<br />Two assertions:<br /><ul><li>Human communications are inherently subjective.
  33. 33. Opinion often masquerades as Fact.</li></li></ul><li>&gt;&gt;Sentiment Analysis<br />“Sentiment analysis is the task of identifying positive and negative opinions, emotions, and evaluations.”<br />-- Wilson, Wiebe & Hoffman, 2005, “Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis”<br />“Great hotel, just a few brilliant streets, full of restaurants and shops, from La Rambla. Beautiful hotel restaurant and the pool is UNBELIEVABLE! Single room is very modern and the blackout blind is awesome on mornings that you wish to sleep for a few more minutes. Will definitely be back!”<br />« Logiciel d’apparence assez simple (j’aime beaucoup l’icône de l’application), mais qui se trouve être très malin et sait se différencier de ses concurrents, par la possibilité de lui appliquer des thèmes ! »<br />
  34. 34. &gt;&gt;Text Visualization<br />http://www.wordle.net/<br />
  35. 35. &gt;&gt;Web 3.0 & the Semantic Web<br />“We have many of the tools in place -- from Web 2.0 technologies… to unstructured data search software and the Semantic Web -- to tame the digital universe. Done right, we can turn information growth into economic growth.”<br />-- “The Diverse and Exploding Digital Universe,” (IDC, 2008)<br />“The Semantic Web is a web of data, in some ways like a global database.” -- Tim Berners-Lee, 1998<br />Web 3.0 = Web 2.0 + the Semantic Web + semantic tools.<br />
  36. 36. &gt;&gt;Web 3.0 & the Semantic Web<br />Recurring themes:<br />Semantically enriched -- context sensitive -- localized.<br />Technical concepts:<br />Linked Data -- Microformats, RDF, SPARQL – OWL.<br />Text analytics enables Web 3.0 and the Semantic Web.<br />Automated content categorization and classification.<br />Text augmentation: metadata generation, content tagging.<br />Information extraction to databases.<br />Exploratory analysis and visualization.<br />
  37. 37. Text Analytics Past, Present & Future<br />Seth Grimes<br />grimes@altaplana.com<br />http://altaplana.com<br />