Your SlideShare is downloading. ×
  • Like
Text Analytics 2009: User Perspectives on Solutions and Providers
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Text Analytics 2009: User Perspectives on Solutions and Providers


A study and report on the state of the text analytics market with material describing text-analytics technology and solutions.

A study and report on the state of the text analytics market with material describing text-analytics technology and solutions.

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Text Analytics 2009: User Perspectives on Solutions and Providers Seth Grimes An Alta Plana research study Sponsored by
  • 2. Text Analytics 2009: User Perspectives Table of Contents Executive Summary................................................................................................................... 3 Text Analytics Basics ................................................................................................................ 4 Discovering Meaning in Text.....................................................................................................4 Software and Solution Market Overview.................................................................................. 7 Applications and Sources ............................................................................................................ 7 Demand-Side Perspectives ........................................................................................................ 9 Study Context..............................................................................................................................9 About the Survey ....................................................................................................................... 10 Demand-Side Study 2009: Response ......................................................................................... 13 Q1: Length of Experience ........................................................................................................... 13 Q2: Application Areas ................................................................................................................ 13 Q3: Information Sources ........................................................................................................... 14 Q4: Return on Investment ......................................................................................................... 15 Q5: Mindshare ............................................................................................................................ 15 Q6: Spending ............................................................................................................................. 16 Q8: Satisfaction ......................................................................................................................... 16 Q9: Overall Experience ............................................................................................................. 16 Q12: Like and Dislike ................................................................................................................. 18 Q13: Information Types ............................................................................................................ 19 Q14: Important Properties & Capabilities ................................................................................ 20 Additional Analysis .................................................................................................................. 21 Selected Cross-tabulations .........................................................................................................21 Interpretive Limitations ............................................................................................................ 22 About the Study ....................................................................................................................... 24 Solution Profile: Attensity ....................................................................................................... 26 Solution Profile: Clarabridge ................................................................................................... 28 Solution Profile: GATE ........................................................................................................... 30 Solution Profile: IxReveal ......................................................................................................... 32 Solution Profile: Nstein ........................................................................................................... 34 Solution Profile: SAP BusinessObjects ................................................................................... 36 Solution Profile: TEMIS ......................................................................................................... 38 Published May 31, 2009 under the Creative Commons Attribution 3.0 License. 2
  • 3. Text Analytics 2009: User Perspectives Executive Summary The global text-analytics market is growing at a very rapid pace, an estimated 40% in 2008, creating a $350 million market for software and vendor supplied support and services. The total business value generated by text-analytics reliant information products, in-house development, service providers, applications such as e-discovery, and research surely multiplies this figure eight-fold. The author projects 2009 market growth up to 25% despite the economic downturn. Market Factors A number of factors have impelled sustained text-analytics market growth. The technology – text mining and related visualization and analytical software – continues to deliver unmatched capabilities both in early-adopter domains such as intelligence and the life sciences and in business sectors that have embraced text analytics more recently, in the last 3-5 years. These latter sectors include, notably, media and publishing, financial services and insurance, travel and hospitality, and consumer products and retail. Business and technical functions such as customer support and satisfaction, brand and reputation management, claims processing, human resources, media monitoring, risk management and fraud, and search have fueled recent growth. No single organization or approach dominates the market. While existing players have been very successful, they and new entrants continue to innovate, offering cutting-edge capabilities, for instance in sentiment analysis, as well as in newer, as-a- service and mash-up ready delivery models and capabilities targeted to market niches. Text Analytics 2009: User Perspectives Insights into the question, “What do current and prospective text-analytics users really think of the technology, solutions, and solution providers?” will help providers craft products and services that better serve users. Insights will guide users seeking to maximize benefit for their own organizations. Alta Plana conducted a spring 2009 survey to explore the topic. This report, “Text Analytics 2009: User Perspectives on Solutions and Providers,” presents findings drawn from 116 responses, the majority of whom already use text analytics. The study was supported by seven sponsors but is editorially independent, designed and conducted by industry analyst and consultant Seth Grimes, a recognized expert in the application of text analytics. Key Study Stats The following are key study findings: Top business applications of text analytics for respondents are a) Brand / product / reputation management (40% of respondents), b) Competitive intelligence (37%), and c) Voice of the Customer / Customer Experience Management (33%) and d) other Research (33%). These applications match a focus on on-line sources: a) blogs and other social media (47%), b) news articles (44%), and c) on-line forums (35%) as well as direct customer feedback in the form of d) e-mail and correspondence (36%) and customer/market surveys (34%). Users with 2 years or more experience prefer tools that support specialized dictionaries, taxonomies, or extraction rules and they often like open source. Prospective users expect to focus their initial text analytics work on inside- the-firewall feedback sources: e-mail, surveys, and contact center materials. Prospective users have high ROI hopes. Use of each of six different measures, led by increased sales to existing customers, is favored by over 50% of respondents who are not current users. Other measures are not far behind. 3
  • 4. Text Analytics 2009: User Perspectives Text Analytics Basics The term text analytics describes software and transformational steps that discover business value in “unstructured” text. The aim is to improve automated text processing. Most everything people do with electronic documents falls into one of four classes: 1. Compose, publish, manage, and archive. 2. Index and search. 3. Categorize and classify according to metadata & contents. 4. Summarize and extract information. Text analytics enhances the first and second sets of functions and enables the third and fourth. The remainder of this section will at the technology, and the section after will look at the market and applications. Discovering Meaning in Text Text analytics encompasses applications of the technology in government, science, and industry and for cross-cutting tasks that range from information retrieval to text- fueled investigative analyses. Text analytics can be seen as a subspecies of business intelligence, and capabilities will be an essential component of the eventual creation of the Semantic Web. Structure in Text Text – news and blog articles, scientific papers, spoken call-center conversations, survey responses, product reviews posted to on-line forums, this report – is replete with structure. Humans (relatively easily) learn to use this structure – the morphology of individual words, the syntax the governs the composition of expressions, the grammar behind phrases and sentences, and the larger-scale structure of text as organized and presented in Web pages, e-mail, newspapers, books, and myriad other forms – to both understand and generate text. We are able to do this without conscious thought, coupled with a grasp of context, knowledge, and emotion that allows us to understand often-complex interactions. Text-analytics software technology – text mining and related visualization and analytical tools – enables machine treatment of text that replicates, automates, and extends human capabilities. Sense-Making through Statistics The earliest approaches to automated text analysis applied statistical methods to text. Consider Hans Peter Luhn‟s 1958 IBM Journal paper, “The Automatic Creation of Literature Abstracts”1, which envisaged application of statistics for sense-making and summarization. Luhn wrote, “Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the auto-abstract.” Luhn illustrated his approach, as shown in the figure below, with the kind of frequency analysis that is performed today by search-engine optimization (SEO) tools and software such as Wordle that generates word and tag clouds. Luhn 1 -- paper is behind a “paywall.” 4
  • 5. Text Analytics 2009: User Perspectives additionally proposed a Keyword-in-Context (KWIC) indexing system that is at the root of modern information retrieval methods. “Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance": H.P. Luhn Vector Space Methods Vector-space models became the prevailing approach to representing documents for information retrieval, classification, and other tasks. The text content of a document is reduced to an unordered “bag of words” that becomes a point in a high-dimensional vector space that may embed the word content of many documents as illustrated in the diagram that appears to the right2. Approaches such as TF-IDF (term frequency–inverse document frequency) weigh the significance of a term according to its prevalence in a larger document set. We apply additional analytical methods to make text tractable, for instance, latent semantic indexing utilizing singular value decomposition for term reduction / feature selection to create a new, reduced concept space. In plain English, such techniques identify and retain the most important concepts and consolidate or eliminate lesser concepts. Text analytics will typically apply one or more of a number of statistical clustering and classification methods to documents. These methods include Naive Bayes, Support Vector Machines, and k-nearest neighbor clustering. The diagram to the left illustrates the identification of a hyperplane, the red line given a 2-D picture, that best separates the dot-/circle-represented documents into distinct sets. 2 Salton, Wong & Yang, “A Vector Space Model for Automatic Indexing,” November 1975 5
  • 6. Text Analytics 2009: User Perspectives Linguistic Approaches Statistical approaches have a hard time making sense of nuanced human language, an issue that H.P. Luhn foresaw in 1958. Luhn wrote in his visionary paper, cited above, "This rather unsophisticated argument on „significance‟ [inferred from a word‟s frequency of use] avoids such linguistic implications as grammar and syntax. In general, the method does not even propose to differentiate between word forms. Thus the variants differ, differentiate, different, differently, difference and differential could ordinarily be considered identical notions and regarded as the same word. No attention is paid to the logical and semantic relationships the author has established. In other words, an inventory is taken and a word list compiled in descending order of frequency." Consider the following pair of sentences, proposed by Luca Scagliarini of Expert System. The two cases produce the same “bag of words” but their meanings, the data content of the texts, is very different given the switch of fell and gained. The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite gained 6.84, or 0.32 percent, to 2,162.78. The Dow gained 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite fell 6.84, or 0.32 percent, to 2,162.78. Linguistic approaches will, for instance, analyze the parts of speech of a phrase, detecting the subject-verb-object triple that constitutes a factual (or subjective) statement as well as additional, modifying elements. Natural Language Processing Part-of-speech (POS) analysis is typically one of a sequence or pipeline of resolving steps applied to text. Other, typically applied steps include: Tokenization: Identification of distinct elements within a text, usually words, expressions, punctuation markets, white space, etc. Stemming: Identifying variants of word bases created by conjugation, declension, case, and pluralization, e.g., “act” for “acts,” “actor,” and “acted.” Lemmatization: Use of stemming and other techniques, including analysis of context and parts of speech, to associate multiple words or terms with a canonical term. For example, "better" might have "good" as its lemma. Entity Recognition: Look-up in lexicons or gazetteers and use of pattern matching to discern items such as names of people, companies, products, and places and expressions such as e-mail addresses, phone numbers, and dates. Tagging: XML mark-up of distinct elements, a.k.a. text annotation. Entities are one type of “feature” found in text. Other features of interest include: Attributes: A person‟s attributes include age, sex, height, and occupation. Abstract attributes: Properties such as “expensive” and “comfortable.” Concepts: Abstractions of entities, for instance, a category. Metadata: In this context, items that describe a document such as the author, creation date, and title as well a topic tag. Facts and relationships: These include statements such as “Dow fell 46.58.” Subjective data: Covers sentiment, opinions, mood, and other attitudinal data. The next section of the report looks at how the technology is applied. 6
  • 7. Text Analytics 2009: User Perspectives Software and Solution Market Overview What we now see as text analytics was actually, in the late 1950s, put forward as the foundation for a visionary business intelligence system. This system would focus on discovering and communicating relationships (and not just data values) and on business-goal alignment. Knowledge-management questions drove this early BI conceptualization, with answers to questions such as: What is known? Who knows what? Who needs to know? to be derived or discovered via text mining.3 Such systems are technically very difficult to realize, and BI of course developed in other directions. Numerical data, drawn from transactional and operational systems and stored in databases, is far easier to analyze than is information locked in text. BI and related tools and techniques – spreadsheets, reporting, OLAP, data mining – generally do an excellent job of creating business value from this data. In the last few years attention has turned back to text sources. Commercial software vendors – and open source projects – have responded to the opportunity. Applications and Sources Applications of text mining in the life sciences and intelligence date to the 1990s, for purposes that include pharmaceutical lead generation – mining scientific literature to accelerate expensive, time consuming drug-discovery processes – and counter- terrorism. A number of factors – the huge and growing volume of on-line content, advances in search and information retrieval, cheap computing power, and better software – have created a market for application of these same text technologies to a much broader variety of business, scientific, and research problems. Application domains Market awareness has grown immensely in the last 3-5 years, but up-take and experiences have varied by application domain. To study adoption, survey question 2 asked, “What are your primary applications where text comes into play?” It listed the following choices, an attempt to capture the most important application domains: Brand/product/reputation management Competitive intelligence Content management or publishing Customer service E-discovery Financial services Compliance Insurance, risk management, or fraud Law enforcement Life sciences or clinical medicine Product/service design, quality assurance, or warranty claims Research (not listed) Voice of the Customer / Customer Experience Management 3 “BI at 50 Turns Back to the Future,” 7
  • 8. Text Analytics 2009: User Perspectives Information sources In each of the application areas listed above, text analytics enhances existing analyses. It enhances both BI and data mining applied to transactional data and non-automated review of textual sources, a.k.a. reading. By automating the reading process, text analytics allows analysts and researchers to tap material that had not previously been systematically mined. It allows them to work far faster than before and to analyze far greater volumes of information than ever before. Importantly, text analytics can make a huge difference in text analysis and processing costs and enable the creation of new information products and services. Survey question 3 asked about information sources. These sources may be grouped: On-line and social media: blogs and other social media (twitter, social-network sites, etc.); news articles; review sites or forums. Enterprise communications and feedback: chat and/or instant messaging text; contact-center notes or transcripts; customer/market surveys; e-mail and correspondence; employee surveys; point-of-service notes or transcripts; SMS/text messages; warranty claims/documentation; Web-site feedback. Operational materials (of course varying by business): crime, legal, or judicial reports or evidentiary materials; insurance claims or underwriting notes; medical records; patent/IP filings; scientific or technical literature. Application modes The applications themselves vary widely. They may be classified in several (overlapping) groups: Media and publishing systems – the author includes search engines here – use text analytics to generate metadata and enrich and index metadata and content in order to support content distribution and retrieval. Semantic Web applications would fit in this category. Content management systems – and, again, related search tools – use text analytics to enhance the findability of content for business processes that include compliance, e-discovery, and claims processing. Line-of-business systems for functions such as compliance and risk, customer experience management (CEM), customer support and service, human resources and recruiting. Investigative and research systems for functions such as fraud, intelligence and law enforcement, competitive intelligence, and life sciences research. This list is representative and not exhaustive. All listed areas are experiencing strong growth. In certain cases, text-analytics‟ contribution is not at all obvious. Google and other major search engines top their responses to “map massachusetts” and “34+178” and “orcl” with a map, the number 212, and Oracle share data, respectively, enabled by their ability to recognize named entities and expressions. This particular application of text analytics is shallow but reaches a very, very large audience. Solution providers Text-analytics solution providers include a significant cadre of young but mature pure-play software vendors, software giants that have built or acquired text technologies, robust open-source projects, and a constant stream of start-ups, many of which focus on market niches or specialized capabilities such as sentiment analysis. The provider-side is vibrant and doing well despite the adverse economic climate due to the market‟s growing awareness of solution providers‟ ability to respond to business needs and technical challenges alike.4 4 8
  • 9. Text Analytics 2009: User Perspectives Demand-Side Perspectives Alta Plana designed a spring 2009 survey, “Text Analytics demand-side perspectives: users, prospects, and the market,” to collect raw material for an exploration of key text- analytics market-shaping questions: What do customers, prospects, and users think of the technology, solutions, and vendors? What works, and what needs work? How can solution providers better serve the market? Will your companies expand their use of text analytics in the coming year? Will spending on text analytics grow, decrease, or remain the same? It is clear that current and prospective text-analytics users wish to learn how others are using the technology, and solution providers of course need demand-side data to improve their products, services, and market positioning, to boost sales and better satisfy customers. The Alta Plana study therefore has two goals: 1. To raise market awareness and educate current and prospective users. 2. To collect information of value to sponsors. Survey findings, as presented and analyzed in this study report, provide a form of measure of the state of the market, a form of benchmark. They are designed to be of use to everyone who is interested in the commercial text-analytics market. Study Context Text-analytics solutions have been applied to a spectrum of business problems. Provider revenues are booming (for most established providers). Academic and industrial research is only expanding, and there has been a steady pace of emergence of new companies in the field, many of them academic spin-offs. Demand-side views are, anecdotally, quite positive, judging from published user stories and case studies and based on inquiries from organizations that are researching solutions. The author previously explored market questions in a number of papers and articles. These included white papers created for the Text Analytics Summit in 2005, The Developing Text Mining Market,”5 and 2007, “What's Next for Text.”6 Analyst and Provider Analyses The 2007 paper contains a number of telling quotations: “Organizations embracing text analytics all report having an epiphany moment when they suddenly knew more than before.” – Philip Russom, the Data Warehousing Institute “Growth is largely driven by the wealth of unstructured information found on the external web, in corporate intranets, document repositories, call- centers, and in customer and employee business communications.” – IBM researcher David Ferrucci Other analysts and solution providers have had a lot to say about text analytics‟ benefits and growth. The article “Perspectives on Text Analytics in 2009”7 is a systematic (albeit informal) survey of industry perspectives that reports provider 5 6 7 9
  • 10. Text Analytics 2009: User Perspectives CEO and CTO and thought-leader responses to the query: “What do you see as the 3 (or fewer) most important text analytics technology, solution or market challenges in 2009?” Responses were informative, based on the respondents‟ own research and, especially, on extensive contact with customers and prospects. In the current context, a market challenge articulated by Aaron B. Brown, IBM program director for ECM Discovery, is particularly telling. That challenge is for solution text-analytics providers to better define business cases. According to Brown, “In the current economic situation, organizations are clamping down on new projects and more than ever looking for hard ROI savings to justify investment. To pass the funding bar, text-analytics solutions, which typically fall in the category of new projects undertaken for business optimization, need to come with solid business cases that demonstrate hard-dollar operational savings based on proven examples. Given the emerging nature of many text- analytics solution areas, this will be a challenge to growth in 2009.” Business cases don‟t rest solely on solution-provider research and assertions, of course. Demand-side experiences and perceptions can and should also contribute. Demand-Side Views A systematic look at the demand side will provide a good complement to provider- side views and to vendor- and analyst-published case studies. Alta Plana‟s 2008 study report, “Voice of the Customer: Text Analytics for the Responsive Enterprise,”8 published by, was our first systematic survey of demand-side perspectives, albeit focused on a particular set of business problems. VoC analysis is frequently applied to enhance customer support and satisfaction initiatives, in support of marketing, product and service quality, brand and reputation management, and other enterprise feedback initiatives. The listening concept is extended to other voice applications: Voice of the Patient, Voice of the Market, etc. Views related in our 2008 study were telling: “Text Analytics is exciting technology, opening up new applications and approaches to solving information needs and supporting decision making for an improved customer experience.” – Michael House, Maritz Research, Division Vice President “We've uncovered concepts and relationships in text that would be too costly – or even impossible – to detect by any other methods. We can now combine multiple data sources to evaluate customer expectations and improve customer satisfaction by employing more one-to-one customer contact and preemptively resolving customer complaints to keep our retention rates high." – Federico Cesconi, Cablecom, head of customer insight and retention About the Survey There were 116 responses to the 2009 survey, which ran from April 13 to May 10. Survey invitations The author solicited responses via: E-mail to the TextAnalytics, Corpora, datamining2, sla-dkm (Special 8 10
  • 11. Text Analytics 2009: User Perspectives Libraries Association, Division for Knowledge Management), sla-dite (SLA Information Technology), Asis-l (American Society for Information Science), and GATE lists and the author‟s personal list. Invitations published in electronic newsletters: Intelligent Enterprise, KDnuggets,, TDWI‟s BI This Week, Text Analytics Summit, and Notices posted to LinkedIn forums and Facebook groups and on twitter. Messages sent by sponsors to their communities. Survey introduction The survey started with a definition and brief description as follow: Text Analytics is the use of computer software to automate: annotation and information extraction from text – entities, concepts, topics, facts, and attitudes, analysis of annotated/extracted information, and document processing – retrieval, categorization, and classification, and derivation of business insight from textual sources. This is a survey of demand-side perceptions of text technologies, solutions, and providers. Please respond only if you are a user, prospect, integrator, or consultant. There are 20 questions. The survey should take you 5-10 minutes to complete. For this survey, text mining, text data mining, content analytics, and text analytics are all synonymous. I'll be preparing a free report with my findings. Thanks for participating! Survey response There is little question that the survey results overweight current text-analytics users – 63% of respondents who answered Q1, “How long have you been using Text Analytics?,” versus 61% of respondents who replied to Q7, “Are you currently using text analytics?” – among the broad set of potential business, government, and academic users. BI market comparison We can infer this overweighting, for example, from market-size figures. The author estimates a $350 million global market for text-analytics software and vendor supplied support and services. By contrast, in March 2009, research firm IDC published a preliminary, 2008 BI-market estimate. IDC‟s sizing “suggests that the business intelligence tools software market grew 6.4% in 2008 to reach $7.5 billion.”9 Former Forrester analyst Merv Adrian estimated $8.4 billion for 2008. A simple, good-enough heuristic says that if the BI market is 20 times the size of the text-analytics market, there are likely around 20 times as many BI users as there are text-analytics users. Data mining comparison Another contrasting data point is that 55% of respondents to a March 2009 KDnuggets poll10 report currently using text analytics on projects. KDnuggets reaches data miners, a technically sophisticated audience who are among the most likely of any market segment to have embraced text analytics. The rate of text-analytics adoption by data miners surely exceeds the rate adoption by any other user sector. 9 10 11
  • 12. Text Analytics 2009: User Perspectives How much did you use text analytics / text mining in 2008? Did not use (45) 45% Used on < 10% of my projects (17) 17% Used on 10-25% of projects (14) 14% Used on 26-50% of my projects (11) 11% Used on over 50% of my projects (14) 14% As an aside, that 52% of KDnuggets respondents stated that in 2009, they would use text analytics more than in 2008, with 42% stating their use would be about the same as in 2008, strongly suggests growth in the user base. 12
  • 13. Text Analytics 2009: User Perspectives Demand-Side Study 2009: Response The subsections that follow tabulate and chart survey responses, which are presented without unnecessary elaboration. Q1: Length of Experience How long have you been using Text Analytics? 70% 60% 50% 40% Response Percentage 30% 20% 10% 0% not using, 6 months one year two years no currently less than 6 to less to less to less four years definite evaluating months than one than two than four or more plans to year years years use Response % 16% 22% 8% 5% 7% 18% 25% Cumulative Response 8% 13% 20% 37% 63% Q2: Application Areas What are your primary applications where text comes into play? Brand / product / reputation management 40% Competitive intelligence 37% Voice of the Customer / Customer Experience … 33% Research (not listed) 33% Customer service 22% Content management or publishing 19% Life sciences or clinical medicine 18% Insurance, risk management, or fraud 17% Financial services 15% E-discovery 15% Product/service design, quality assurance, or … 14% Other 13% Compliance 8% Law enforcement 7% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 13
  • 14. Text Analytics 2009: User Perspectives Q3: Information Sources What textual information are you analyzing or do you plan to analyze? blogs and other social media 47% news articles 44% e-mail and correspondence 36% on-line forums 35% customer/market surveys 34% scientific or technical literature 27% contact-center notes or transcripts 25% Web-site feedback 21% review sites or forums 21% medical records 16% employee surveys 16% insurance claims or underwriting notes 15% chat and/or instant messaging text 15% other 14% crime, legal, or judicial reports or evidentiary materials 13% point-of-service notes or transcripts 12% patent/IP filings 11% SMS/text messages 8% warranty claims/documentation 7% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 14
  • 15. Text Analytics 2009: User Perspectives Q4: Return on Investment Question 4 asked, “How do you measure ROI, Return on Investment? Have you achieved positive ROI yet?” Results are charted from highest to lowest values of the sum of “currently measure” and “plan to measure”: How do you measure ROI, Return on Investment? increased sales to existing 54% customers 51% higher satisfaction ratings improved new-customer 46% acquisition higher customer retention/lower 39% churn reduction in required staff/higher 38% staff productivity Measure or Plan to more accurate processing of 36% Measure claims/requests/casework faster processing of 36% claims/requests/casework Plan to Measure ability to create new information 34% products fewer issues reported and/or 30% Achieved service complaints lower average cost of sales, new 30% & existing customers higher search ranking, Web 28% Currently Measure traffic, or ad response 18% other 0% 20% 40% 60% Q5: Mindshare A word cloud, generated at, seemed a good way to present responses to the query, “Please enter the names of text-analytics companies you have heard of.” 15
  • 16. Text Analytics 2009: User Perspectives Q6: Spending Question 6 asked, “How much did your organization spend in 2008, and how much do you expect to spend in 2009, on text-analytics solutions?” 13% use open source 11% 14% 20% use open source 7% under $50,000 6% under $50,000 7% 8% $50,000 to $99,000 $50,000 to $99,000 $100,000 to $199,999 $100,000 to $200,000 to $499,999 20% 38% $199,999 22% $500,000 or above 34% $200,000 to $499,999 2008 Spending 2009 Spending Q8: Satisfaction Question 8 asked, “Please rate your overall experience – your satisfaction – with text analytics.” Results are as shown: 23% Completely satisfied Satisfied 2% Neutral 2% 53% Disappointed Very disappointed 21% Q9: Overall Experience Question 9 asked, “Please describe your overall experience – your satisfaction – with text analytics.” The following are 32 verbatim responses, lightly edited for spelling and grammar and to mask the two products that were named: We are highly satisfied. Costs were lower than expected due to high degree of automation. Expectations were exceeded. More timely and more fine grained customer insight and market intelligence and competitive intelligence than ever before. It's been a fun journey, but still struggling with how to get to root cause and how far text 16
  • 17. Text Analytics 2009: User Perspectives analytics can get you there vs. need analysts. No one solution addresses every use case. Some solutions better address the up-front creation of dictionaries than others. I would like a more automated system the integrates with our current IS. Not really neutral but it's sort of a love hate thing. There's a very high learning curve, sometimes it's seductive to measure things that aren't relevant - to run things just because you cannot because they tell you anything. But the customers like it - even if they don't understand it. I want to see more applications Pretty good on named entity extraction, fairly good on fact extraction, poor on sentiment analysis. Several possibilities, several applications; Emphasis on efficiency enhancing; solutions; Problems in selling accuracy. I was satisfied of the effectiveness of the tools - specifically for named-entity recognition. Good but still have a ways to go with capabilities OK, it is hard to describe satisfaction of using text analytics tools when we all know how language is ambiguous and complex - we cannot expect too much from automatic processing yet, maybe in the time when neutral networks can be used, but NLP on its own cannot impress us yet I think. Developing part-of-speech tagging for Arabic text, morphological analyzer, to deal with wide range of text domain, formats and genres. Frustration with developing custom dictionaries that allow real-time categorization of content. Pleased with progress in neural analysis of text content. I'm building this all myself using open source tools. I'm extremely satisfied. Hard learning curve, but we have it going now. Excellent. We have pretty low expectations for the accuracy of automated classification techniques, and those are fulfilled but not exceeded. We use automated categorization in building demos, but most of our customers use semi-automated or manual tagging. It has been extremely valuable in certain situations. We always look at the text and verbatims with our [product] software It's great, but most of it is primarily designed for the English language only. As soon as you need other languages, you need a lot of different providers (= increased implementation costs) or you have to pay a lot of money. I have written an entire textbook based upon text analytics and plan to write another. 92% accuracy, 6.7 fold increase in productivity, cut search time by 50% Hundreds of hours of auditors’ time has been saved by a combination of scanning of hard copy evidence, electronic evidence collection, and importing into [product], building business rules from auditors defined keywords to produce first cut analysis classification. Very satisfied - state-of-the-art in text analytics is advancing at a very rapid pace and text- analytics based solutions are able to demonstrate business value addition/ROI. Feedback from our users with the current tools is that they are not meeting their needs, which is why we are looking at other solutions. Difficult implementation into our core software, but now works as designed. We have presented sentiment analysis on a wide range of documents and used the information to be predictive in nature. Text analytics allows us to gain new customer and market insights as well as better competitive intelligence: higher report frequency, automated reporting, lower cost, finer granularity. Great hopes. Long way to go. Too early to tell. 10 million Voice of Customer can be in real time understood. 17
  • 18. Text Analytics 2009: User Perspectives Q12: Like and Dislike Question 12 asked, “What do you like or dislike about your solution or software provider(s)?” Respondents were allowed to enter up to five points. Twenty-seven individuals responded, entering a total of 82 points. One respondent entered “cost” in all five slots. The following table normalizes, classifies as positive or negative, and groups the responses into thematic categories. We take the sum of positive and negative remarks in a category as indicating the category‟s importance, so the chart is sorted in descending order of (sum) number of remarks. What do you like or dislike about your solution or software provider(s)? 14 12 Plus 10 Minus Sum 8 6 4 2 0 18
  • 19. Text Analytics 2009: User Perspectives Q13: Information Types Do you need (or expect to need) to extract or analyze - Other 15% Other entities – phone numbers, e-mail & street 40% addresses Metadata such as document author, publication date, 53% title, headers, etc. Events, relationships, and/or facts 55% Concepts, that is, abstract groups of entities 58% Sentiment, opinions, attitudes, emotions 60% Topics and themes 65% Named entities – people, companies, geographic 71% locations, brands, ticker symbols, etc. 0% 10% 20% 30% 40% 50% 60% 70% 80% Q19: Comments There were twelve comments. Several pushing-the-envelope respondent observations were particularly interesting: “We were shocked at the lack of appreciation for hosted and/or turnkey solutions from many vendors we evaluated in 2008. The product capabilities of many commercial solutions were poorly conceived, leading us to believe that they didn't really understand the potential of text analytics.” “As a market research supplier, my clients cross a number of industries. Thus, lack of scalability is the major obstacle to adopting text analysis for my purpose.” “Twitter data requires new text analytic algorithms, because of the presence of „@person‟ fields, hashtags, and HTML links that have been shortened. As a consequence, "traditional" algorithms don't work. I am developing those algorithms myself, which is yet another reason I use open source tools exclusively.” One other comment is interesting and prompts a response. “We are building an information retrieval product and wish to embed out-of- the-box functionality but with the option to plug in other 3rd party analytical components.” The response is that several frameworks provide a plug-in architecture for the construction of IR and other text-analytics applications. These include: UIMA, the Unstructured Information Management Architecture, an Apache Incubator project that was recently approved as an OASIS standard. GATE, the General Architecture for Text Engineering. Eclipse SMILA, a new SeMantic Information Logistics Architecture project. 19
  • 20. Text Analytics 2009: User Perspectives Q14: Important Properties & Capabilities What is important in a solution? Important Properties & Capabilities ability to use specialized dictionaries, taxonomies, or extraction 62% rules broad information extraction capability 59% deep sentiment/opinion extraction 53% low cost 51% support for multiple languages 39% predictive-analytics integration 37% BI (business intelligence) integration 35% open source 24% ability to create custom workflows 24% sector adaptation (e.g., hospitality, insurance, retail, health care, 23% communications, financial services) media monitoring/analysis interface 22% hosted or "as a service" option 22% supports data fusion / unified analytics 19% interface specialized for your line-of-business 17% vendor's reseller/integrator/OEM relationships with tech or 13% service providers other 9% 0% 20% 40% 60% 80% 20
  • 21. Text Analytics 2009: User Perspectives Additional Analysis The survey was designed so that responses to questions would be easy to interpret and immediately useful without elaborate cross-tabulation or filtering. The exception was cross-tabulation of length of time using text analytics and of whether a respondent is currently using text analytics or not with other variables. Selected Cross-tabulations The author‟s interpretation of survey findings generally supports prior notions, points such as – Length of involvement with text analytics correlates with particularity of requirements. Each bar represents the percentage of respondents in a time category who indicated that “ability to…” is important: 100% 90% 80% Ability to use specialized 70% 60% dictionaries, taxonomies, 50% or extraction rules is 40% 30% important 20% 10% 0% Ability to create custom less than 6 one year two four workflows is important 6 months to less years to years or months to less than two less than more than one years four year years Length of involvement with text analytics correlates with preference for open source: Open source is important versus Time using Text Analytics 60% 40% 20% 0% less than 6 6 months to one year to two years to four years or months less than one less than two less than more year years four years Using / Not Other interesting points come out of contrasting respondents who are already using text analytics with respondents who are still in planning stages. Sources The top responses to “What textual information are you analyzing or do you plan to analyze?” for current users are: blogs and other social media (twitter, social-network 62% sites, etc.) 21
  • 22. Text Analytics 2009: User Perspectives news articles 55% on-line forums 41% e-mail and correspondence 38% customer/market surveys 35% These are on-line and other feedback-rich sources. Their high rate of selection suggests that veteran users have found significant benefit in these sources. By contrast, only three information-type categories were selected by over 26% of respondents who are not yet using text analytics: e-mail and correspondence 37% customer/market surveys 34% contact-center notes or transcripts 29% It‟s easy to infer that the value of on-line materials (social media, news articles, forums), which is evident to current users, has not yet become clear to prospective users. That only a minority chose any particular category suggests some combination of the following, that Prospective users are more broadly distributed across application categories. Prospective users are cautious about how many different sources they tackle initially. The particular top selections suggest that the plurality – the largest portion – of prospective users will focus initially on materials they have on hand that involve interactions with known stakeholders. Web sources can come later. Expectations Prospective users are not similarly guarded in their expectations. When responses to Question 4 “How do you measure ROI, Return on Investment?” are split out by current versus prospective use, six measures are each selected by between 50% and 55% of prospective-user respondents. They are: increased sales to existing customers improved new-customer acquisition higher satisfaction ratings fewer issues reported and/or service complaints faster processing of claims/requests/casework reduction in required staff/higher staff productivity (Of prospective-user respondents, almost a quarter are already using “increased sales to existing customers” as an ROI measure, which make sense. Sales are easily tracked and analyzed by current systems where items such as satisfaction ratings are not.) “Higher customer retention/lower churn” comes in at just under 50% and three others top 38%. These prospective users, and the folks who advise them, would do well to manage and focus their expectations. Interpretive Limitations The number of survey respondents was not large enough to support further useful 22
  • 23. Text Analytics 2009: User Perspectives cross-tabulation of variables beyond the analyses above. In interpreting presented findings, do keep in mind that the survey was not designed or conducted scientifically, that is, with the intention or the actuality of creating a random sample or a statistically robust characterization of the broad market. Findings surely reflect selection bias due to 1) the outlets where the survey was advertised and 2) a likelihood that those individuals who are unaware of text analytics or the potential for text analytics to help them solve their business problems would not respond to the survey. Findings therefore over-represent current text-analytics users and also over-represent, to a lesser extent, the business intelligence and data warehousing communities. Finally, responses to several of the survey questions were not especially illuminating or likely to be of much use to report readers. These questions are, in particular, Question 10. Who is your provider? Question 11. How did you identify and choose your provider? Question 15. What BI (business intelligence) software do you use if any? Question 16. What social media do/would you look to for text-analytics contacts, discussions, or information? Question 17. What industry publications do you receive, on paper or electronically? Question 18. What industry/technical conferences do you attend? 23
  • 24. Text Analytics 2009: User Perspectives About the Study Text Analytics 2009: Users Perspectives on Solutions and Providers reports the findings of a study conducted by Seth Grimes, president and principal consultant at Alta Plana Corporation. Findings were drawn from responses to a spring 2009 survey of current and prospective text-analytics users, consultants, and integrators. The survey asked respondents to relay their perceptions of text-analytics technology, solutions, and vendors. It asked users to describe their organizations‟ usage of text analytics and their experiences. Sponsors The author is grateful for the support of seven sponsors – Attensity, Clarabridge, the University of Sheffield (GATE project), IxReveal, Nstein, SAP, and TEMIS – whose financial contribution enabled him to conduct the current study and publish study findings. The content of the sponsor solution profiles was provided by the sponsors. The survey findings and the editorial content of this report do not necessarily represent the views of the study sponsors. This report, with the exception of the sponsor solution profiles, was not reviewed by the sponsors prior to publication. Media Partners The author acknowledges assistance received from six media partners in disseminating invitations to participate in the survey. Those media partners are Intelligent Enterprise, KDnuggets,,, the Text Analytics Summit, and The Data Warehousing Institute (TDWI). Seth Grimes Author Seth Grimes is an information technology analyst and analytics strategy consultant. He is contributing editor at Intelligent Enterprise magazine, founding chair of the Text Analytics Summit, an instructor for The Data Warehousing Institute (TDWI), KDnuggets contributor, and text analytics channel expert at the Business Intelligence Network. Seth founded Washington DC-based Alta Plana Corporation in 1997. He consults, writes, and speaks on information-systems strategy, data management and analysis systems, industry trends, and emerging analytical technologies. Seth can be reached at, 301-270-0795. 24
  • 25. Text Analytics 2009: User Perspectives Sponsor Solution Profiles 25
  • 26. Text Analytics 2009: User Perspectives Solution Profile: Attensity Business is built on conversations. These customer, partner, and employee conversations are captured in emails, call notes, letters, surveys, forums and other social media, and more. Attensity enables you to use these conversations to drive better relationships with your customers – transforming them into loyal advocates of your business. Attensity delivers the power of sophisticated data and semantic analytics in an integrated suite of easy-to-use business applications, allowing business leaders, customer support personnel, and customers to get relevant and actionable answers fast. An Integrated Suite of Products to Help You Manage the Customer Conversation: Analyze and Respond Attensity's ability to extract valuable insight from free-form text anywhere and transform it into actionable insights offers organizations the opportunity to understand their customers and to manage the entire customer conversation – analyzing and responding to customer needs. Recognized as best-of-breed by leading analysts for more than a decade, our applications, powered by the industry’s leading natural language processing technologies, are designed to automate related business processes, and add the rigor and speed necessary to swiftly identify often subtle relationships and root causes and to respond timely and accurately to customers. Equally important, our easy-to-use business applications are not only designed for analysts, but also for business leaders, researchers, brand and category managers, and customer service representatives, while also used directly by customers to efficiently self- serve. Attensity Voice of the Customer/Market Voice allows your organization to glean and analyze your customers’ candid thoughts about your brand and products, rapidly and accurately understanding and analyzing comments in E-Service records, surveys, and emails, along with the market buzz found in web communities, blogs, product reviews and social media sites. This delivers the actionable insights - authentic customer sentiments and issues around your brand, products, services, your competitors and more -- you need to make smarter decisions and deliver better products and services. Attensity Voice of the Customer/Market Voice features sophisticated integrated reporting and pre- packaged Voice of the Customer extraction domains for fast-time-to-value, detailed sentiment analysis, and an extensive partner solutions network to help you extend the value of your applications. Attensity’s other products include E-Service Suite, Automated Response Management, Research and Discovery and Intelligence Analysis. E-Service Suite offers an Agent Service Portal and a Self-Service application that enables your customers to effectively self-serve while your agents are empowered to extend informed and efficient service support real-time. Attensity Automated Response Management, a part of the E-Service Suite, optimizes and automates up to 100% of the handling of all incoming and outbound customer communications, enabling you deliver a superior customer experience while achieving significant operational efficiency and productivity gains in your contact center. Research and Discovery provides your organization with sophisticated information extraction, advanced classification and enterprise-class search of and access to internal and external data, helping you meet compliance and litigation demands while controlling costs. Intelligence Analysis allows commercial and government organizations to “connect the dots” by delivering automatic extraction and analytical processing of “relational events” from unstructured data –not only who or what, but the “why, when, where and how.” A Relentless Focus on Customer Success Companies across the full industrial spectrum and around the globe are discovering how our advanced solutions help them thrive by helping resolve customer support issues more quickly, enable more accurate research and analysis of customer feedback, and rapidly address and proactively prevent problems while mitigating risk. Across industries, companies are optimizing customer interaction processes in the contact center, deepening customer relations through effective and efficient self-serve support, and growing their competitive edge with Attensity solutions adapted to their industry specific business needs. Attensity’s team of vertical experts allow us to provide expert advice and specialized applications for areas such as aerospace, automotive, consumer packaged goods, contact center outsourcing, financial services and insurance, government and law enforcement, healthcare, hospitality, manufacturing, media and publishing, retail, technology, 26
  • 27. Text Analytics 2009: User Perspectives and telecommunications. Attensity has a strong record of customer success across all of our products, including Voice of the Customer, E-Service, and Research and Discovery. Three of our VoC success stories are presented here. JetBlue Airways | New York-based JetBlue Airways has created a new airline category based on value, service and style. Known for its award-winning service and free TV as much as its low fares, JetBlue is now pleased to offer customers the most legroom throughout coach (based on average fleet-wide seat pitch for U.S. airlines). JetBlue is also America’s first and only airline to offer its own Customer Bill of Rights, with meaningful compensation for customers inconvenienced by service disruptions within JetBlue’s control. JetBlue Airways currently uses Attensity’s Voice of the Customer application in its customer service organization to uncover customer issues, requirements and overall sentiment about the airline. The company’s pilot project demonstrated a significant ability to find key information about customer sentiment and tangible data around how to augment its services. JetBlue uses Attensity VoC to proactively manage and analyze all freeform customer feedback to improve service, marketing, sales and the products they offer. “From our Customer Bill of Rights to our customer advisory council, JetBlue is dedicated to bringing humanity back to air travel,” Bryan Jeppsen, Research Analyst Manager said. “One of the best ways to do that is to listen — truly listen — to our customers. Our commitment with Attensity enables us to mine subtle but important clues from all forms of customer communications to continue improving all aspects of our company. We’re eager to learn as much as we can, and we’re excited to have Attensity’s simple to use yet sophisticated software at our service.” JetBlue customer service analysts use Attensity VoC daily to cull insights and actions from feedback. “Attensity Voice of the Customer offers us the unprecedented ability to automatically extract customer sentiments, preferences and requests we simply wouldn’t find any other way,” according to Jeppsen. “Attensity VOC enables us to intelligently structure, search and integrate the data into our other business intelligence and decision-making systems.” Charles Schwab | For this Global 1000 investment services firm, Attensity is a central part of efforts to understand and act on customer feedback. With hundreds of thousands of interactions per month, the need to understand customer issues, act on signs of dissatisfaction and churn and drive sales and service interactions can be the difference between success and failure. With Attensity they are able to capture these interactions through customer service notes, emails, survey responses and online discussions and analyze them to power customer retention and growth. Attensity Voice of the Customer enables Schwab to analyze customer feedback to drive proactive programs and understand emerging issues and opportunities, communicate key issues and opportunities at the client segment level on a daily basis, and integrate this valuable customer feedback into their SAS analytics platform on their Teradata data warehouse to expand the customer signature and to deepen customer loyalty analytics. Attensity has become integral to Schwab customer program planning and churn identification efforts. The firm has improved satisfaction and been able to mitigate churn via improved direct broker communications with customers and marketing programs. Customer satisfaction, specifically reasons customers are not happy, is directly monitored and specific issues are addressed. Issues can include problems with services, communication, collateral, and specific individual interactions. Attensity also helps the firm dig deep into Net Promoter™ Program results, uncovering reasons customers give low scores and identify as “detractors.” Attensity contributed to important changes to account statements. Most importantly, Attensity reduced the time needed to analyze customer satisfaction issues from almost 1 year to less than one week! Whirlpool | As a $13.2B appliance manufacturer and the #1 appliance manufacturer in the world, Whirlpool focuses on great products and great customer relationships to maintain and grow its global customer base. As a customer-centered company, Whirlpool need to understand the root cause of pain points and brand, product, and service related issues. With the vast amounts of customer service records, emails, survey response and online community forums, there is more than enough data to get and use customer insights to improve customer experiences. When Whirlpool started with Attensity in 2004, the company wanted to be able to leverage the web and over 8.5 million annual customer and repair visit interactions captured in service notes to drive marketing programs, product development, and quality initiatives. Whirlpool has done just that and more. With over 300 Attensity VoC users worldwide, Whirlpool listens and acts on customer data in the service department, its innovation and product developments groups, and in the market every day. With Attensity VoC, Whirlpool gets early warning of safety and warranty issues and has been able to mitigate expensive recalls through rapid change out of defective parts. Whirlpool extrapolates an ~80% reduction in the cost of recalls due to early detection. In addition to Attensity-fueled product quality improvements, Whirlpool better understands customers’ needs and wants – and the competition and what they are doing to win over customers. 27
  • 28. Text Analytics 2009: User Perspectives Solution Profile: Clarabridge Clarabridge was founded with the simple premise of enabling companies to drive business value by understanding key customer and prospect experiences. Now more than ever, consumer-focused companies turn to Clarabridge to help retain customers, attract new customers, cut servicing and operational costs, sell more products to current customers, and develop more relevant products and services. Clarabridge is the leading provider of text mining software for Customer Experience Management (CEM) due to four key strengths: Commitment to CEM applications: Clarabridge’s rapid growth is due to a focus on the value our customers gain from leveraging our VOC solutions. Our staff, technology, customers, and partners are all 100% focused around delivering VOC applications, and our entire company is committed to providing business value for our customers. Speed-to-Value: No other advanced text mining solution can be deployed as efficiently and powerfully as Clarabridge. Whether an implementation is source specific or enterprise-wide, no other vender can compete with the speed in which our customers not only implement but recognize value. Market Leadership: We believe that being a market leader is more than market statistics and sales wins. While Clarabridge dominates these statistics, we believe that being the market leader also means being a thought leader, an innovator and a standard-setting force in the marketplace. Clarabridge is the first company in our industry to organize a specific user group and conference on using text-mining to support VOC. The Best Technology: There are many great technologies in the text mining world. Some are proven in academia and government think tanks, others within very controlled implementations. But no current text mining technology can compete with our ability to deliver repeatable and tangible business value within the commercial space. Enabled by text analytics, CEM provides the opportunity to create innovative offerings from the start while targeting the precise customer segments and later react to customer feedback on desired improvements and enhancements. Text Mining to Support Business Improvements Clarabridge’s content mining process involves three integrated components: 1. Collect and Connect: Clarabridge's pre-built source connectors allow easy access to external and internal customer information, harvesting content from all of your listening posts. Clarabridge’s built-in feedback module allows the design, deployment and capture of surveys, campaigns, community forums and web forms. 28
  • 29. Text Analytics 2009: User Perspectives 2. Mine and Refine: Once all textual content is sourced, Clarabridge extracts meaning through its fully integrated and automated features, so millions of verbatims transform seamlessly into actionable information. Clarabridge deep parsing Natural Language Processing technology extracts parts of speech and linguistic relationships. This output is used for downstream entity & fact extraction, sentiment extraction, categorization, and root cause analysis. 3. Analyze and Discover: Clarabridge provides two interfaces with a range of functional and analytic tools: Clarabridge Reporting and Analysis and Clarabridge Navigator. Analysts and business users can identify key themes and emerging issues, dynamically investigate results, set up alerts and drill into root causes with the full discovery functionality integrated into the software. Case Studies: Technology in Action Today, leading Fortune 1000 companies across all major markets rely on Clarabridge for the essential customer experience intelligence they require for strategic insight and pre- emptive action. Supported by the Clarabridge Content Mining Platform, clients are able to capture the 360-degree view on current customer attitudes and sentiment shifts, rather than to settle for a limited understanding of their Voice of the Customer. Use cases reflect Clarabridge’s successful engagements and their outcomes with clients across a range of major industries. AOL uses Clarabridge to manage, capture and analyze over 5 million website feedback forms for over 150 products in dozens of languages. Clarabridge automatically processes and reports the now quantified insights directly to product teams. A major international airline company uses Clarabridge to capture and analyze over 7 million surveys per year, allowing them to analyze drivers of loyalty and dissatisfaction for all of their customer segments. The airline can better meet the needs of their passengers through improved understanding of the drivers of customer satisfaction. Gaylord Entertainment used Clarabridge to replace their manual guest satisfaction review process with automatic coding, sentiment extraction and reporting. VOC analysis is available near real-time based on the needs of Gaylord employees. Driving more business through high value event planners and raising customer satisfaction scores, Gaylord has had enormous business and customer experience success using Clarabridge. Vision, Experience, and Strength Clarabridge’s goal is to help you fully access your customer experience intelligence—and leverage that information to your advantage. By bridging the gap between your customer’s experience and your brand’s promise, we provide a unique portal into the human dimension of your business. With this insight, you gain the strategic edge in serving your customers, controlling costs and risk, competing resourcefully, and building profitability. When you work with Clarabridge, you work with the management team that had guided the company’s growth and innovation from the start. Each has had decades of experience, bolstered by successful entrepreneurial ventures and strengthened by prior top-level management experience. Executives, who include a nationally recognized entrepreneur and a multiple patent holder, are all published authors and frequent speakers at industry conferences. With a commitment to excellence, partnership model with clients, and fast-paced development processes, Clarabridge is strong from the ground up. What’s more, our financial backing, board advisors, reputation, and partnerships are sound, ensuring our software will evolve to meet your emerging demands. 29
  • 30. Text Analytics 2009: User Perspectives Solution Profile: GATE An Open Source Solution General Architecture for Full Lifecycle for Text Engineering Text Analytics FREE founder member of OASIS/UIMA committee. Open source, licensed under LGPL allowing EFFICIENT unrestricted commercial use, hosted on SourceForge. Optimisations included with the latest version 100% JAVA provide a 20 to 40% speed and memory usage Runs on any platform supporting Java 5 or later. improvement. Developed and tested daily on Linux, Windows, and Highly efficient finite state text processing engine; Mac OS X. many plug-ins with linear execution time. MATURE AND ACTIVELY SUPPORTED POPULAR In development since 1996; now at version 5.0; Assessed as “outstanding” and “internationally around 20 active developers. leading” by an anonymous EPSRC peer review. COMPREHENSIVE Used at thousands of sites: companies, universities and research laboratories, all over the world. Support for manual annotation, performance ~35,000 downloads/year. evaluation, information extraction, [semi-]automatic semantic annotation, and many other tasks. Rolling funding for more than 15 staff at the University of Sheffield. Over 50 plug-ins included with the standard distribution, containing over 70 resource types. Many DATA MANAGEMENT others available from independent sources. Pluggable input filters with out of the box support for XML, HTML, PDF, MS Word, email, plain text, etc. Common in-memory data model built around stand-off annotation, documents and corpora. Persistent storage layer with support for XML or Java serialisation. I/O interoperation with many other systems. STANDARD ALGORITHMS Ready made implementations for many typical NLP tasks such as tokenisation, POS tagging, sentence splitting, named entity recognition, co-reference resolution, machine learning, etc. USER INTERFACE Comprehensive tool set for data editing and INTEGRATION visualisation, rapid application development, manual Leveraging the power of other projects such as: annotation, ontology management. • Information Retrieval: Lucene (Nutch, Solr), Google and Yahoo search APIs, MG4J; • Machine Learning: Weka, MaxEnt, SVMLight, etc.; • Ontology Support: Sesame and OWLIM; • Parsing: RASP, Minipar, and SUPPLE; • Other: UIMA, Wordnet, Snowball, etc. COMMUNITY AND SUPPORT Friendly and active community of developers and users offers efficient help. Commercial support available from Ontotext and Matrixware. STANDARDS BASED Reference implementation in ISO TC37/SC4 LIRICS project; supports XCES, ACE, TREC etc. formats; 30
  • 31. Text Analytics 2009: User Perspectives OVERVIEW GATE was first released in 1996, then completely re-designed, re-written, and re-released in 2002. The system is now one of the most widely-used systems of its type and is a comprehensive infrastructure for language processing software development. The new UIMA architecture from IBM/Apache has taken inspiration from GATE and IBM have paid the University of Sheffield to develop an interoperability layer between the two systems. Key features of GATE are: • Component-based development reduces the systems integration overhead in collaborative research. • Automatic performance measurement of Language Engineering (LE) components promotes quantitative comparative evaluation. • Distinction between low-level tasks such as data storage, data visualisation, discovery, and loading of components and the high-level language processing tasks. • Clean separation between data structures and algorithms that process human language. • Consistent use of standard mechanisms for components to communicate data about language, and use of open standards such as Unicode and XML. • Insulation from idiosyncratic data formats (GATE performs automatic format conversion and enables uniform access to linguistic data). • Provision of a baseline set of LE components that can be extended and/or replaced by users as required. TEXT ANALYSIS Text Analysis (TA) is a process which takes unseen texts as input and produces fixed- format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in Information Retrieval (IR) applications. TA covers a family of applications including named entity recognition, relation extraction, and event detection. GATE has been used for TA applications in domains including bioinformatics, health and safety, and 17th century court reports. TA systems built on GATE have been evaluated among the top ones at international competitions (MUC, ACE, Pascal). A system built by the GATE team came top in two of three categories in the NTCIR 2007 patent classification competition. THE GATE FAMILY • GATE Developer: an integrated development environment for language processing components bundled with the most widely used Information Extraction system and a comprehensive set of other plug-ins • GATE Embedded: an object library optimised for inclusion in diverse applications giving access to all the services used by GATE Developer and more • GATE Teamware: a collaborative annotation environment for high volume factory-style semantic annotation projects built around a workflow engine and the GATE Cloud backend web services • GATE Cloud: a parallel distributed processing engine that combines GATE Embedded with a heavily optimized service infrastructure FIRST COUSINS: THE ONTOTEXT FAMILY • Ontotext KIM: UIs demonstrating our multiparadigm approach to information management, navigation and search • Ontotext Mimir: (Multi-paradigm Information Management Index and Repository) a massively scaleable multiparadigm index built on Ontotext's semantic repository family, GATE's annotation structures database plus full-text indexing from MG4J Sponsored by:, Contact: Prof. Hamish Cunningham Research funding: EU, UK Research Councils and JISC 31
  • 32. Text Analytics 2009: User Perspectives Solution Profile: IxReveal IxReveal is a leading analytics software company that transcends current search and business intelligence technologies. Our patented platforms transform large volumes of unstructured and structured data into actionable intelligence, while enabling automatic and collaborative sharing of concepts, connections, and findings. Clients include global corporations, financial institutions, health organizations, and major government agencies with data-intensive needs in areas such as fraud, security, finance, crime, and intelligence. is aimed at helping analysts in organizations solve business problems and making informed business decisions by leveraging their investment in Law Enforcement: “uReveal collecting data. Organizations have spent millions of dollars in made our analysts ridiculously collecting and storing information like crime incidents, claims, efficient.” customer calls, emails etc. With uReveal, they are able to combine - Crime Analysis Manager the structured and unstructured data to find meaningful trends and patterns to fight crime and insurance fraud and to reshape the organization to be customer focused. uReveal provides the bottom or top-line changing ability to analyze huge volumes of textual data. It works with various data sources Insurance: “Level of accuracy of like existing search infrastructures, databases containing textual suspicious claims identification information, emails, and content management systems. uReveal’s increased five-fold and false powerful decision support capabilities are finally making it possible positives decreased.” to find trends and patterns and zero in on critical slices of - Insurance Claims Manager information buried deep within the text. uReveal is a tool that has been developed for analysts, putting them in control by enabling them to focus their precious time on value-added analysis - instead of having to read all the documents returned. It is designed for small to mid-sized workgroups that work with vast amount of free-form information as part of their jobs. With an intuitive and highly configurable user interface Insurance: “This technology not only helps and patent pending analytics (such as relationship our analysts become very efficient but discovery and integrated charting/graphing helps us save on legal costs as well.” capabilities), uReveal users are able to create a - Workers Comp Fraud Manager personalized environment to get their job done faster. uReveal is the solution for analytical teams that work with unstructured information and provide decisive insight as part of a mission critical business process. Using uReveal, they can both find and substantiate business insights and recommendations, pointing back to the unstructured information as validation. 32
  • 33. Text Analytics 2009: User Perspectives uReveal gives your analysts: Control over the analytic process. A user-friendly configuration that frees analysts from calling on expensive IT specialists A built-in repeatable methodology to quickly accomplish goals. The ability to slice and dice large amounts of text data in different ways to find trend patterns, and hidden nuggets of critical information. The capability to combine data from databases with the textual data, completely leveraging all the information collected by an organization (i.e., in one task the analyst may be interested in “finding a needle in the haystack” insight while in another task, the analysts may need to “track trends and themes” in the data). is aimed at helping consumers, enterprise searchers, students and researchers rapidly understand and find information on topics from the myriad search results and websites that they are inundated with. uReka! is the only configurable “search and analyze” product that dramatically increases speed, relevance and insight. Stop Reading. Start Finding. Find It Easily switch between multiple search engines. Searches numerous internal and external data sources. Use It Reintroduces serendipitous discovery. Automatically reads the results for you. Provides new suggestions and extends the thought process. Share It Saves an unlimited number of "Concept Banks." Extracts concepts from search results. Integrates with existing Microsoft Windows security policies. Key Benefits of Using IxReveal Solutions To overcome the myriad of problems presented by today’s information overload and to increase overall efficiency and profitability, IxReveal provides solutions for Voice of the Customer, Market Intelligence, Call Center Analytics, Fraud Detection Analytics, and Intelligence Analytics. These solutions deliver important benefits – For the Analyst: Provides a multifold productivity increase. Changes the analytical paradigm from "search to read" to "search to analyze." Easy-to-use software accelerates user adoption, shortens "time-to-benefits" and reduces dependency on overburdened IT. Supports all analytic needs from finding a needle in a haystack to identifying trends and patterns. Easy to create graphs and charts. For the Organization: Allows organizations to leverage investments in all collected data – structured and unstructured. Reduces start up costs - leverages existing infrastructures. Reduces adoption costs – easy to use software enables quick learning and decreases time to achieve value. Positively impacts both top and bottom line for the company. 33
  • 34. Text Analytics 2009: User Perspectives Solution Profile: Nstein An old philosophical riddle asks: if a tree falls in the forest, and no one is there to hear it, does it make a sound? Thinkers have amused themselves for centuries crafting clever solutions to the question, with little consensus emerging. Consider this: If you publish your content and no one consumes it, does it have value? That one is easy. The answer is no. For content to have value it must be connected with people. The more targeted and qualified the audience, the more valuable the content. Text Mining helps you connect your content with people. It improves visibility, findability, relevancy - thus bolstering SEO, stickiness and ultimately, CPM. Nstein Technologies Since Nstein’s founding a decade ago, the world has borne witness to a rapidly shifting publishing landscape. Information has become decentralized, content divorced from brand, publishing stalwarts left floundering. But through all this, Nstein’s mission has remained the same: to connect people to relevant and valuable content. Connecting people to content: that’s what Nstein does, and does well. Reed Business Information, Reader’s Digest, Condé Nast, impreMedia, Hearst, ProQuest, Financial Times and Scripps – prestigious newspapers, magazines, broadcasters and online content providers worldwide – can all attest to the efficacy of Nstein’s solutions. Powering Digital Publishing Nstein's content management solutions are backed by powerful linguistic technology that enables intelligent and lucrative content linking and repurposing. Publishers and other content-driven organizations increase digital revenues and decrease operational costs by centrally managing all content assets. Web Content Management (WCM)is a framework built to help publishers attract more unique visitors and increase time spent on their online properties, ultimately increasing online revenues. Digital Asset Management (DAM)is expressly designed for complete management of content assets, enabling organizations to: o syndicate and repurpose content automatically over multiple channels o increase staff productivity o better control operating costs Text Mining Engine (TME) is a powerful multilingual solution that enables publishers to offer a more engaging experience to online readers by leveraging the "aboutness" of their content. The result is an automated process that generates contextual metatags for various content types and sources, from newsroom articles, newsfeeds, audio and video files to blogs, forums, and user-generated content. Nstein's TME establishes links among content assets and surrounds them with rich and relevant information. What sets Nstein solutions apart from others is that they are content- rather than document-centric. By integrating Web Content Management (WCM), Digital Asset Management (DAM) and Text Mining Engine (TME), Nstein has given publishers a way of aligning their business models with powerful digital content management that can be easily integrated into their existing content supply chain. 34
  • 35. Text Analytics 2009: User Perspectives A Different Way of Thinking We begin with a core premise that beneath all data lies more data: metadata. Inside every word, sentence, paragraph, article, journal, book or archive there is metadata; additional information about the information being presented. There are interrelationships between different metadata. Harnessing the power of this information-about-information is the first step in connecting people to content, which is inherent to capitalizing on that content. Text Mining Technology: An Engine of Change Nstein’s Text Mining Engine (TME), now in its fifth and most powerful iteration, enriches content with intelligent metadata. The evolution of TME spans 10 years, and predicts the rise of the semantic web. According to IDC research analyst Sue Feldman, “most readers don't ask the right questions when searching for information,” thus, they don't find what they are looking for. TME adds layers of meaning to content, making it significantly more findable. TME first identifies nouns, verbs, subjects and objects. It extracts proper names, identifies the context of a piece, creates categories and creates smart interrelationships between entities. TME then places a rich layer of metadata in the content, making it extremely more findable, organized, optimized and ready for a variety of monetization initiatives. Reader’s Digest: Business Enabled TME makes content market-ready. It turns content archives into assets. Once content is semantically enriched, business opportunities are significantly broadened. Consider how 80-year old Reader’s Digest leveraged text mining. The company wanted to create a joint venture with In a feasibility test, a team of eight manually sifted through back issues, marking them with Post-Its®, searching for soft copy then converting it into the proper format. After 2.5 weeks, the team had managed to accrue only 28 assets. This was hardly a scalable solution and the opportunity was now at risk. Nstein had implemented a centralized solution for Reader’s Digest that automatically ingested content from more than 60 magazines and 40 books, semantically analyzed and enriched those pieces of content for easy search. Post installation, the team responsible for finding data was able to scale back to four, and, in only one week was able to pull more than 200 assets – all in proper format. This represented a more than 3500 percent productivity increase, but more importantly, Reader’s Digest was business enabled. What was an impossible task became possible, and what was unviable became opportunity. 35
  • 36. Text Analytics 2009: User Perspectives Solution Profile: SAP BusinessObjects SAP BusinessObjects offers a broad portfolio of tools and applications designed to help clients optimize business performance by connecting people, information, and businesses across business networks. The SAP BusinessObjects intelligence platform breaks the barriers of traditional business intelligence to ensure that all business users – enterprise-wide – have immediate access to reliable business information so they can do their jobs efficiently and effectively. SAP BusinessObjects information management (IM) solutions provide comprehensive information management functionality that can help you deliver integrated, accurate, and timely data – both structured and unstructured – across your enterprise. These powerful solutions can empower you to provide trusted data for key initiatives such as business transaction processing, business intelligence, data warehousing, data migration, and master data management. With software from SAP BusinessObjects you can leverage powerful data integration capabilities that enable you to: Access all types of structured and unstructured data from virtually any source, from databases to Web forums. Integrate and deliver data in real-time or batch using flexible approaches through data federation or extraction, transformation, and loading (ETL). Improve data quality with the ability to profile, cleanse, and match data during the ETL process. Empower Your Business with Insights from Unstructured Text In a challenging economy, companies can’t afford to make costly mistakes in terms of strategy, product development, customer care, and operations. Too often in the pursuit of agility, companies make assumptions on what their customers truly want or where they can improve products and services and be more competitive. However, correct decisions and effective strategy development require a complete and accurate understanding of your customers, your markets, and your business. In fact, in a recent 2008 Gartner and Forbes survey11, 58% of C-level SAP BusinessObjects Text Analysis integrates with your executives and business leaders decision framework indicated that exploiting information as a strategic asset is a top CIO priority in the next five years. Unfortunately, some of the most compelling and powerful information about evolving customer needs, product pain points, and recurring service issues is locked away and inaccessible to the decision makers who need it most. But how do you tap the right information when over 80% of all corporate data is locked away in unstructured text sources such as e-mail, documents, notes fields, and Web content? For example, as you seek to improve customer satisfaction, important information on customer frustrations, opinions, and feedback remains hidden in customer relationship management (CRM) comment fields, blog sites, survey notes, and e-mail. And for organizations working to ensure compliance to regulations, textual information that includes risk-related issues lies hidden in documents, records, and contracts. According to Gartner, unstructured data doubles every three months, and 7 million Web pages are added every day. Most companies don’t have the time, resources, or outsourcing budget to tackle this overwhelming amount of data using a heavy-weight, manual approach. The paradox is that businesses can’t afford to ignore this information either. Fortunately, there is a better way. 11 Raskino, Mark. Lopez, Jorge. “The Gartner/Forbes Executive Survey, 2008,” Gartner, August 2008. 36
  • 37. Text Analytics 2009: User Perspectives SAP® BusinessObjects™ Text Analysis software processes, classifies, and summarizes vast amounts of text-based information – helping you gain better insight into your business so that you can empower those initiatives that directly improve your bottom line. With SAP BusinessObjects Text Analysis, you open a window to all the information you need to achieve a 360-degree view of your organization, your customers, your market, and your competitors. Entity-extraction functionality is the foundation of SAP BusinessObjects Text Analysis. Powerful extraction parses large volumes of documents, identifying “entities” such as customers, products, locations, and financial information relevant to your organization. Entity extraction is complemented by categorization capabilities that can apply company-specific SAP’s robust, multilanguage text-processing engine complements the or industry-specific taxonomies company’s market-leading BI and analysis tools to the text for subject-level classification and summarization that creates readily understood abstracts. The software’s advanced linguistics capabilities read and “understand” documents in more than 30 major languages. Through sophisticated natural language processing, SAP BusinessObjects Text Analysis knows how verbs, nouns, and other language structures interact. In essence, it understands the meaning and context of information – not just the words themselves. SAP BusinessObjects customers in industries a varied as public sector, publishing, oil & gas, and high tech are using Text Analysis in applications for: Homeland security and law enforcement Automated content aggregation Online brand monitoring Customer service automation Content management and knowledge sharing Legal e-discovery Enterprise search Archiving and storage Data quality With SAP BusinessObjects Text Analysis, you can: Tap your customers’ opinions to improve business results. Unlock critical insights hidden in online forums, call-center logs, CRM systems, and survey data, such as customers’ sentiment about your brand, products, and services. Power your BI initiative with information from all text sources. Complement structured BI with the wisdom buried in unstructured text sources by accessing and deriving meaning from hundreds of thousands of text documents – in a variety of file formats and in more than 30 major languages. Extract “what you didn’t know” from unstructured data. Deploy powerful extraction, categorization, and summarization of your free-form text information to quickly identify and understand the concepts, people, organizations, places, and other information that only exists in these sources. Integrate with your SAP BusinessObjects deployments. Incorporate valuable information from unstructured data sources into your organization’s decision-making framework. Add SAP BusinessObjects Text Analysis to your portfolio alongside your SAP NetWeaver ® technology platform, SAP BusinessObjects Enterprise software deployment, or SAP BusinessObjects Data Integrator software environment. 37
  • 38. Text Analytics 2009: User Perspectives Solution Profile: TEMIS TEMIS is the leading provider of Text Analytics software solutions for the Enterprise. Its cutting-edge solution Luxid® addresses the needs of Life Sciences, Publishing, Enterprise, and Homeland Security industries. Its powerful information intelligence capabilities power strategic applications such as Scientific Discovery, Content Enrichment, Sentiment Analysis, and Competitive Intelligence by turning unstructured data into actionable knowledge, enabling advanced content analysis and strategic information discovery. Founded in 2000, TEMIS operates in the United States and Europe and is represented worldwide through its network of certified partners. TEMIS' innovative solutions have attracted leading organizations such as Bayer Schering Pharma, Ingenuity, Novartis, Sanofi-Aventis, Solvay Pharmaceuticals - BASF, BNP Paribas, PSA Peugeot-Citroën, Total - Agence France-Presse, Bertelsmann, CARMA International, Elsevier, Editions Lefebvre-Sarrut, Interone Worldwide, an Agency of BBDO Worldwide, Nature Publishing Group, Springer Science+Business Media, The McGraw-Hill Companies, Thomson Reuters - Europol, French Ministry of Defense, French Ministry of Finance, Invest in France Agency - Convera, EMC. Luxid® Fundamentals Luxid® is a powerful and scalable solution giving immediate access to non obvious information and delivering industry-specific knowledge from internal and external data sources. It brings long-awaited answers to the challenge of information discovery and knowledge extraction from unstructured data. Extract domain-specific knowledge Multilingual annotators detect and extract high value added information from unstructured data, by reliably identifying entities and semantic relationships relevant to your domain. Sift through large corpuses: To extract the set of most relevant documents in a corpus, users refine searches using self-adjusting filters relying on entities and semantic relationships. See trends and patterns: Thanks to a variety of dynamic charts, tables, cross-tabs, and reports that enable slicing & dicing, drilling up & down, users gain insight into the meaning behind the data. Connect the dots: Users can unveil information dependencies by intuitively exploring a network of entities linked by either their proximity or semantic relationships. Spark collaborative discoveries Personalized, dynamic dashboards support organizational knowledge sharing. Key application areas Based on patented, award-winning technology and benefiting from a close collaboration with leading corporations, Luxid® is a powerful and scalable solution designed to be particularly relevant in the following four application areas: Scientific Discovery Innovations rarely come out of a vacuum: it has now become essential for Research and Development to be acutely aware of prior and ongoing work both internally and in competing teams across their industry. Given the massive amount of available scientific literature in proprietary and public content repositories, and in particular scientific articles and patents, Luxid® for Scientific Discovery has become an essential productivity tool to efficiently analyze this content, extract non-obvious information and key insights to drive R&D’s agenda. Luxid® for Scientific Discovery is the platform of choice for optimizing research effectiveness by shortening the innovation cycle, avoiding costly dead-ends, pointing toward virgin or underexploited areas of interest, minimizing Intellectual Property infringement risks and associated legal costs. 38
  • 39. Text Analytics 2009: User Perspectives Content Enrichment The emergence of the Internet has opened the door to new ways of accessing media, new access devices, new expectations and social practices, and new competitors, which represent a daunting challenge to the traditional publishing business model. This also brings to the table an unprecedented range of new content monetization concepts, formats, and channels that represent an opportunity to re-invent the business. A case in point is the targeting of the Long Tail, a previously inaccessible set of increasingly smaller customer segments from micro- communities all the way to the audience-of-one. To take advantage of these opportunities, publishers rely on Luxid® for Content Enrichment to annotate their massive amounts of previously static and unstructured content with domain-relevant metadata. Once annotated, it becomes possible to efficiently navigate content, selectively extract the documents or document sections which are precisely relevant to a given topic, group documents in clusters, and further enrich them by establishing links with other documents or information sources. By easing the repurposing of content into a virtually limitless range of custom formats and enabling unprecedented time-to-market, narrower focus and increased relevance, Luxid® for Content Enrichment also enhances content stickyness, enabling higher audience retention and involvement, recognized as the key to Publishers’ future revenue growth. Sentiment Analysis Consumers are using the Internet with increasing intensity and are leading a massive, global conversation about products, services, brands, and companies. They use online social media, phone calls, e-mail, chats, and text messages to discuss what, when, and how they buy. Consumers, not marketers, now lead the discussion. By listening into these conversations, Luxid® for Sentiment Analysis identifies the reaction of the public to products, brands, events, and policies. It also helps identify key influencers and locate relevant media for a given topic. These strategic insights help organizations optimize their R&D investment, develop more relevant products and gain market share, while reducing product return rates and detecting epidemic quality defects earlier. Competitive Intelligence Today, organizations must effectively deal with information overload. Information is a raw material that they must collect, sort, and understand in order to optimize their processes of anticipation, innovation, and decision-making. Based on TEMIS technology and resulting from close collaboration with leading customers, Luxid® for Competitive Intelligence automates the analysis of information flows by extracting Competitive Intelligence topics to power your financial, market, technology and product watch strategies. Luxid® Architecture To scale up to the diverse needs of organizations and ease deployment, the core architecture of Luxid® has been structured into three stackable software layers: Luxid® Annotation Factory annotates documents of any format with extracted entities, relations, categories, and topics. Built as a robust and scalable platform, its deep understanding of all major languages powers the ability to reliably identify high value information across domains and geographies. Luxid® Information Mart federates heterogeneous sources and enriches the harvested documents leveraging Luxid® Annotation Factory in order to build the knowledge base powering the information discovery. Luxid® Information Analytics enables discovery through a web-based and feature-rich portal designed to search, analyze, discover and share underlying knowledge with information consumers. A key aspect of Luxid® is that its annotation capabilities can be further customized and extended to specific industries or domains by developing custom Skill CartridgesTM that model corresponding Entities and Relationships. This gives the solution a virtually universal applicability. 39