An Introduction to Text Analytics: 2013 Workshop presentation

6,649 views

Published on

Workshop presented by Seth Grimes at the 2013 Text Analytics Summit.

Published in: Technology, Education
1 Comment
31 Likes
Statistics
Notes
  • Great presentation, perhaps could be extended with the following 3 items ,,, but I might be biased in my suggestion since these are the field in which I've been largely working myself ..
    - Desktop tool: KNIME (www.knime.org)
    - when talking on RDF then also mention: NLP2RDF, OPENLINKEDDATA
    - the lifescience applications of text analytics are very much developed, and are entering both social media (pharmacovigilance) and mining electronic health records (patient recruitment, IMI-EHR4CR)
    Again, great talk, pitty that one cannot download it :-)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
6,649
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
0
Comments
1
Likes
31
Embeds 0
No embeds

No notes for slide

An Introduction to Text Analytics: 2013 Workshop presentation

  1. 1. An Introduction to Text Analytics:Social, Online, and EnterpriseSeth GrimesAlta Plana Corporation@sethgrimesText & Social Analytics Summit 2013WorkshopJune 4, 2013
  2. 2. Text Analytics Introduction2013 Text Analytics Summit2PerspectivesPerspective #1: You support research or work in IT.You help end users who have lots of text.Perspective #2: You’re a researcher, business analyst, orother “end user.”You have lots of text. You want an automated way to dealwith it.Perspective #3: You work for a solution provider.Perspective #4: Other?
  3. 3. Text Analytics Introduction2013 Text Analytics Summit3From the Analytics/Business Perspective1. If you are not analyzing text – if youre analyzing onlytransactional information – youre missing opportunityor incurring risk.2. Text analytics can boost business results –“Organizations embracing text analytics all report having anepiphany moment when they suddenly knew more thanbefore.”-- Philip Russom, the Data Warehousing Institute, 2007http://tdwi.org/articles/2007/05/09-what-works/bi-search-and-text-analytics.aspx– via established BI / data-mining programs, orindependently.
  4. 4. Text Analytics Introduction2013 Text Analytics Summit4Agenda1. The “Unstructured” Data Challenge.2. Text analytics for information retrieval and BI.3. Text analysis technologies and processes.4. Applications.5. The market: Software, services, and solutions.Note:I will not cover the agenda in a linear fashion.Product images and references are used to illustrate only.Some slides are included for reference purposes.
  5. 5. Text Analytics Introduction2013 Text Analytics Summit5Value in TextIt’s a truism that 80% of enterprise-relevant informationoriginates in “unstructured” form:E-mail and messages.Web pages, news & blog articles, forum postings, and othersocial media.Contact-center notes and transcripts.Surveys, feedback forms, warranty claims.Scientific literature, books, legal documents....http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/Non-text “unstructured” content?ImagesAudio including speechVideoValue derives from patterns.
  6. 6. Text Analytics Introduction2013 Text Analytics Summit6Unstructured SourcesThese sources may contain “traditional” data.The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard& Poors 500 index fell 1.44, or 0.11 percent, to 1,263.85.And they may not.www.stanford.edu/%7ernusse/wntwindow.htmlAxin and Frat1 interact with dvl and GSK, bridgingDvl to GSK in Wnt-mediated regulation of LEF-1.Wnt proteins transduce their signals throughdishevelled (Dvl) proteins to inhibit glycogen synthasekinase 3beta (GSK), leading to the accumulation ofcytosolic beta-catenin and activation of TCF/LEF-1transcription factors. To understand the mechanismby which Dvl acts through GSK to regulate LEF-1, weinvestigated the roles of Axin and Frat1 in Wnt-mediated activation of LEF-1 in mammalian cells. Wefound that Dvl interacts with Axin and with Frat1,both of which interact with GSK. Similarly, the Frat1homolog GBP binds Xenopus Dishevelled in aninteraction that requires GSK.www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&cmd=Retrieve&list_uids=10428961&dopt=Abstract
  7. 7. Text Analytics Introduction2013 Text Analytics Summit7http://www.tropicalisland.de/NYC_New_York_Brooklyn_Bridge_from_World_Trade_Center_b.jpgx(t) = ty(t) = ½ a (et/a + e-t/a)= acosh(t/a)http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsbergStructure in “unstructured” sourcesStructure in “unstructured” sources
  8. 8. Text Analytics Introduction2013 Text Analytics Summit8Business IntelligenceConventional BI feeds off:It runs off:"SUMLEV","STATE","COUNTY","STNAME","CTYNAME","YEAR","POPESTIMATE",50,19,1,"Iowa","Adair County",1,8243,4036,4207,446,225,221,994,50950,19,1,"Iowa","Adair County",2,8243,4036,4207,446,225,221,994,50950,19,1,"Iowa","Adair County",3,8212,4020,4192,442,222,220,987,50550,19,1,"Iowa","Adair County",4,8095,3967,4128,432,208,224,935,48850,19,1,"Iowa","Adair County",5,8003,3924,4079,405,186,219,928,49550,19,1,"Iowa","Adair County",6,7961,3892,4069,384,183,201,907,47250,19,1,"Iowa","Adair County",7,7875,3855,4020,366,179,187,871,45450,19,1,"Iowa","Adair County",8,7795,3817,3978,343,162,181,841,43950,19,1,"Iowa","Adair County",9,7714,3777,3937,338,159,179,805,417“The bulk of information valueis perceived as coming fromdata in relational tables. Thereason is that data that isstructured is easy to mine andanalyze.”-- Prabhakar Raghavan,Google, formerlyYahoo Research
  9. 9. Text Analytics Introduction2013 Text Analytics Summit9Business IntelligenceConventional BI produces:
  10. 10. Text Analytics Introduction2013 Text Analytics Summit10Text-BI: Back to the FutureBusiness intelligence (BI) was first defined in 1958:“In this paper, business is a collection of activities carried on forwhatever purpose, be it science, technology, commerce,industry, law, government, defense, et cetera... The notion ofintelligence is also defined here... as ‘the ability to apprehendthe interrelationships of presented facts in such a way as toguide action towards a desired goal.’”-- Hans Peter Luhn“A Business Intelligence System”IBM Journal, October 1958What was IT like in the ‘50s?
  11. 11. Documentinput andprocessingKnowledgehandling iskeyDesk Set (1957): Computer engineerRichard Sumner (Spencer Tracy)and television network librarianBunny Watson (Katherine Hepburn)and the "electronic brain" EMERAC.
  12. 12. Text Analytics Introduction2013 Text Analytics Summit12From the Information Retrieval PerspectiveWhat do people do with electronic documents?1. Publish, Manage, and Archive.2. Index and Search.3. Categorize and Classify according to metadata &contents.4. Information Extraction.For textual documents, text analytics enhances #1 & #2 andenables #3 & #4.You need linguistics to do #1 & #4 well, to deal with meaning(a.k.a. semantics).Search is not enough...
  13. 13. Text Analytics Introduction2013 Text Analytics Summit13SemanticsText analytics adds semantic understanding of –Named entities: people, companies, places, etc.Pattern-based entities: e-mail addresses, phone numbers, etc.Concepts: abstractions of entities.Facts and relationships.Concrete and abstract attributes (e.g., 10-year, expensive,comfortable).Subjectivity in the forms of opinions, sentiments, andemotions: attitudinal data.Call these elements, collectively, features.
  14. 14. Text Analytics Introduction2013 Text Analytics Summit14Semantics, Analytics, and IRText analytics generates semantics to bridge search, BI, andapplications, enabling next-generation informationsystems.Search BIApplica-tionsSearch basedapplications(search + text +apps)Information access(search + text + BI)Integrated analytics(text + BI)Text analytics(inner circle)Semantic search(search + text)NextGen CRM, EFM,MR, marketing, …
  15. 15. Text Analytics Introduction2013 Text Analytics Summit15Information AccessText analytics transforms Information Retrieval (IR) intoInformation Access (IA).• Search terms become queries.• Retrieved material is mined for larger-scale structure.• Retrieved material is mined for features such as entities andtopics or themes.• Retrieved material is mined for smaller-scale structure suchas facts and relationships.• Results are presented intelligently, for instance, groupingon mined topics-themes.• Extracted information may be visualized and explored.
  16. 16. Text Analytics Introduction2013 Text Analytics Summit16Information AccessText analytics enables results that suit the information andthe user, e.g., answers –
  17. 17. Text Analytics Introduction2013 Text Analytics Summit17Text Data Mining Enables Content ExplorationDecisive Analyticshttp://www.dac.us/
  18. 18. Text Analytics Introduction2013 Text Analytics Summit18Text Analytics DefinitionText analytics automates what researchers, writers,scholars, and all the rest of us have been doing for years.Text analytics –Applies linguistic and/or statistical techniques to extractconcepts and patterns that can be applied to categorize andclassify documents, audio, video, images.Transforms “unstructured” information into data forapplication of traditional analysis techniques.Discerns meaning and relationships in large volumes ofinformation that were previously unprocessable bycomputer.
  19. 19. Text Analytics Introduction2013 Text Analytics Summit19Glossary: Information in TextEntity: Typically a name (person, place, organization, etc.) ora patterned composite (phone number, e-mail address).Concept: An abstract entity or collection of entities.Feature: An element of interest, e.g., an entity, concept,topic, event, fact, relationship, etc.Metadata: Descriptive information such as author, title,publication date, file type and size.Information Extraction (IE) involves pulling metadata,features & their attributes out of textual sources.Co-reference: Multiple expressions that describe the samething. Anaphora including pronoun use is an example:John pushed Max. He fell.John pushed Max. He laughed.-- Laure Vieu and Patrick Saint-Dizier
  20. 20. Text Analytics Introduction2013 Text Analytics Summit20Glossary: MethodsNatural Language Processing (NLP): Computers hearhumans.Parsing: Evaluating the content of a document or text.Tokenization: Identification of distinct elements, e.g.,words, punctuation marks, n-grams.Stemming/Lemmatization: Reducing word variants(conjugation, declension, case, pluralization) to bases.Term reduction: Use of synonyms, taxonomy, similaritymeasures to group like terms.Tagging: Wrapping XML tags around distinct features, a.k.a.text augmentation. May involve text enrichment.POS Tagging: Specifically identifying parts of speech viasyntactic analysis.
  21. 21. Text Analytics Introduction2013 Text Analytics Summit21Glossary: Organizing and StructuringCategorization: Specification of feature/doc groupings.Clustering: Creating categories according to outcome-similarity criteria.Taxonomy: An exhaustive, hierarchical categorization ofentities and concepts, either specified or generated byclustering.Classification: Assigning an item to a category, perhapsusing a taxonomy.Ontology : In practice, a classification of a set of items in away that represents knowledge.An oak is a tree. A rose is a flower. A deer is an animal. A sparrow is abird. Russia is our fatherland. Death is inevitable.-- P. Smirnovskii, A Textbook of Russian GrammarSemantics relates features to others.
  22. 22. Text Analytics Introduction2013 Text Analytics Summit22Glossary: EvaluationPrecision: The proportion of decisions (e.g., classifications)that are correct.Recall: The proportion of actual correct decisions (e.g.,classifications) relative to the total number of correctdecisions.Find the even numbers:9 17 12 4 1 6 2 20 7 3 8 10What is my Precision? What is my Recall?Accuracy: How well an IE or IR task has been performed,computed as an F-score weighting Precision & Recall,typically:f = 2*(precision * recall) / (precision + recall)Relevance: Do results match the individual user’s needs?
  23. 23. Text Analytics Introduction2013 Text Analytics Summit23Text Analytics PipelineTypical steps in text analytics include –1. Identify and retrieve documents for analysis.2. Apply statistical &/ linguistic &/ structural techniques todiscern, tag, and extract entities, concepts, relationships,and events (features) within document sets.3. Apply statistical pattern-matching & similarity techniquesto classify documents and organize extracted featuresaccording to a specified or generated categorization /taxonomy.– via a pipeline of statistical & linguistic steps.Let’s look at them, at steps to model text...
  24. 24. Text Analytics Introduction2013 Text Analytics Summit24Modelling TextMetadata.E.g., title, author, date.Statistics.E.g., term frequency, co-occurrence, proximity.Linguistics.Lexicons, gazetteers, phrase books.Word morphology, parts of speech, syntactic rules.Semantic networks.Larger-scale structure including discourse.Machine learning.
  25. 25. Text Analytics Introduction2013 Text Analytics Summit27“Statistical information derived from word frequency and distribution isused by the machine to compute a relative measure of significance, firstfor individual words and then for sentences. Sentences scoring highest insignificance are extracted and printed out to become the auto-abstract.”-- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958.http://wordle.net
  26. 26. Text Analytics Introduction2013 Text Analytics Summit28Modelling TextThe text content of a document can be considered anunordered “bag of words.”Particular documents are points in a high-dimensional vectorspace.Salton, Wong &Yang, “A VectorSpace Model forAutomaticIndexing,”November 1975.
  27. 27. Text Analytics Introduction2013 Text Analytics Summit29Modelling TextWe might construct a document-term matrix...D1 = “I like databases”D2 = “I hate hate databases”and use a weighting such as TF-IDF (term frequency–inversedocument frequency)…in computing the cosine of the angle between weighteddoc-vectors to determine similarity.I like hate databasesD1 1 1 0 1D2 1 0 2 1http://en.wikipedia.org/wiki/Term-document_matrix
  28. 28. Text Analytics Introduction2013 Text Analytics Summit30Modelling TextAnalytical methods make text tractable.Latent semantic indexing utilizing singular valuedecomposition for term reduction / feature selection.Creates a new, reduced concept space.Takes care of synonymy, polysemy, stemming, etc.Classification technologies / methods:Naive Bayes.Support Vector Machine.K-nearest neighbor.
  29. 29. Text Analytics Introduction2013 Text Analytics Summit31Modelling TextIn the form of query-document similarity, this is InformationRetrieval 101.See, for instance, Salton & Buckley, “Term-WeightingApproaches in Automatic Text Retrieval,” 1988.A useful basic tech paper: Russ Albright, SAS, “Taming Textwith the SVD,” 2004.Given the complexity of human language, statistical modelsmay fall short.“Reading from text in general is a hard problem, because itinvolves all of common sense knowledge.”-- Expert systems pioneer Edward A. Feigenbaum
  30. 30. Text Analytics Introduction2013 Text Analytics Summit32“Tri-grams” hereare pretty good atdescribing theWhatness of thesource text. Yet...“This rather unsophisticated argument on ‘significance’avoids such linguistic implications as grammar and syntax...No attention is paid to the logical and semanticrelationships the author has established.”-- Hans Peter Luhn, 1958
  31. 31. Text Analytics Introduction2013 Text Analytics Summit33New York Times,September 8, 1957Anaphora /coreference:“They”
  32. 32. Text Analytics Introduction2013 Text Analytics Summit34Advanced Term CountingCounting term hits, in one source, at the doc level, doesn’ttake you far...Good or bad? What’s behind the posts?
  33. 33. Text Analytics Introduction2013 Text Analytics Summit35Why Do We Need Linguistics?To get more out of text than can be delivered by abag/vector of words and term counting.The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard& Poors 500 index gained1.44, or 0.11 percent, to 1,263.85.The Dow gained 46.58, or 0.42 percent, to 11,002.14. TheStandard & Poors 500 index fell 1.44, or 0.11 percent, to1,263.85.-- Luca Scagliarini, Expert SystemTime flies like an arrow. Fruit flies like a banana.-- Groucho Marx(Statistical co-occurrence to build a model for analysis oftext such as these is possible but still limited.)
  34. 34. Text Analytics Introduction2013 Text Analytics Summit36Parts of Speech
  35. 35. Text Analytics Introduction2013 Text Analytics Summit37Parts of Speech
  36. 36. Text Analytics Introduction2013 Text Analytics Summit38Parts of Speech
  37. 37. Text Analytics Introduction2013 Text Analytics Summit39From POS to RelationshipsWhen we under-stand, for instance,parts of speech(POS), e.g. –<subject> <verb><object>– we’re in a positionto discern facts andrelationships...Semantic networkssuch as WordNetare an asset forword-sensedisambiguation.
  38. 38. Text Analytics Introduction2013 Text Analytics Summit41Tagging with GATEAnnotation (tagging) in action via GATE, an open-sourcetool:
  39. 39. Text Analytics Introduction2013 Text Analytics Summit42GATE Language Processing Pipeline
  40. 40. Text Analytics Introduction2013 Text Analytics Summit43GATE Text Annotation
  41. 41. Text Analytics Introduction2013 Text Analytics Summit44GATE Exported XML<?xml version=1.0 encoding=windows-1252?><GateDocument><!-- The documents features--><GateDocumentFeatures><Feature><Name className="java.lang.String">MimeType</Name><Value className="java.lang.String">text/html</Value></Feature><Feature><Name className="java.lang.String">gate.SourceURL</Name><Value className="java.lang.String">http://altaplana.com/SentimentAnalysis.html</Value></Feature></GateDocumentFeatures><!-- The document content area with serialized nodes --><TextWithNodes><Node id="0" />Sentiment<Node id="9" /> <Node id="10" />Analysis<Node id="18"/>:<Node id="19" /> <Node id="20" />A<Node id="21" /> <Node id="22" />Focus<Node id="27" /> <Nodeid="28" />on<Node id="30" /> <Node id="31" />Applications<Node id="43" /><Node id="44" /><Node id="45" />by<Node id="47" /> <Node id="48" />Seth<Node id="52" /> <Node id="53"/>Grimes<Node id="59" /> [material cut]</TextWithNodes><!-- The default annotation set --><AnnotationSet> [material cut]<Annotation Id="67" Type="Token" StartNode="48" EndNode="52"><Feature><Name className="java.lang.String">length</Name><ValueclassName="java.lang.String">4</Value></Feature><Feature><Name className="java.lang.String">category</Name><ValueclassName="java.lang.String">NNP</Value></Feature><Feature><Name className="java.lang.String">orth</Name><ValueclassName="java.lang.String">upperInitial</Value></Feature><Feature><Name className="java.lang.String">kind</Name><ValueclassName="java.lang.String">word</Value></Feature><Feature><Name className="java.lang.String">string</Name><ValueclassName="java.lang.String">Seth</Value></Feature></Annotation> [material cut]</AnnotationSet></GateDocument>
  42. 42. Text Analytics Introduction2013 Text Analytics Summit45Information ExtractionFor content analysis, key in on extracting information.Text features are typically marked up (annotated) in-placewith XML.Entities and concepts may correspond to dimensions in astandard BI model.Both classes of object are hierarchically organized and haveattributes.We can have both discovered and predetermined classifications(taxonomies) of text features.Dimensional modelling facilitates extraction todatabases...
  43. 43. Text Analytics Introduction2013 Text Analytics Summit46Database InsertIllustrated via an IBM example:“The standard features are stored in the STANDARD_KW table,keywords with their occurrences in the KEYWORD_KW_OCCtable, and the text list features in the TEXTLIST_TEXT table.Every feature table contains the DOC_ID as a reference to theDOCUMENT table.”
  44. 44. Text Analytics Introduction2013 Text Analytics Summit47Sophisticated Pattern MatchingLexicons and language rules boost accuracy. An example –a bit complicated – GATE Extension via JAPE Rules.../* locationcontext2.jape* Dhavalkumar Thakker, Nottingham Trent University/PA Photos 15 Sept 2008*/Phase: locationcontext2Input: Lookup TokenOptions: control = all debug = false//Manchester, UKRule: locationcontext2Priority:50({Token.string == "at"})( ({Token.string =~ "[Tt]he"})?(({Token.kind == word, Token.category == NNP, Token.orth == upperInitial}({Token.kind == punctuation})?{Token.kind == word, Token.category == NNP, Token.orth == upperInitial}({Token.kind == punctuation})?{Token.kind == word, Token.category == NNP, Token.orth == upperInitial} ) |( {Token.kind == word, Token.category == NNP, Token.orth == upperInitial}({Token.kind == punctuation})?( {Token.kind == word, Token.category == NNP, Token.orth == allCaps} |{Token.kind == word, Token.category == NNP, Token.orth == upperInitial} )) |...
  45. 45. Text Analytics Introduction2013 Text Analytics Summit48Predictive ModelingAnother processing pipeline and more rules…
  46. 46. Text Analytics Introduction2013 Text Analytics Summit49Predictive ModelingIn the text context, predictive analytics is mostly aboutclassification and automated processing.Modeling also helps, operationally, with:CompletionDisambiguation: use dictionaries, contextError correctionhttp://en.wikipedia.org/wiki/File:ITap_on_Motorola_C350.jpg
  47. 47. Text Analytics Introduction2013 Text Analytics Summit50Error Correction“Search logs suggest that from 10-15% of queries containspelling or typographical errors. Fittingly, one importantquery reformulation tool is spelling suggestions orcorrections.”-- Marti Hearst, Search User Interfaces
  48. 48. Text Analytics Introduction2013 Text Analytics Summit51Accuracy and Semi-Structured SourcesAn e-mail message is “semi-structured,” which facilitatesextracting metadata --Date: Sun, 13 Mar 2005 19:58:39 -0500From: Adam L. Buchsbaum <alb@research.att.com>To: Seth Grimes <grimes@altaplana.com>Subject: Re: Papers on analysis on streaming dataseth, you should contact divesh srivastava,divesh@research.att.com regarding at&t labs data streamingtechnology.Adam“Reading from text in structured domains I don’t think is ashard.”-- Edward A. FeigenbaumSurveys are also typically s-s in a different way...
  49. 49. Text Analytics Introduction2013 Text Analytics Summit52Structured &‘Unstructured’The respondent is invited to explain his/her attitude:
  50. 50. Text Analytics Introduction2013 Text Analytics Summit53Structured &‘Unstructured’We typically look at frequencies and distributions of coded-response questions:Linkage of responses tocoded ratings helps inanalyses of free text:
  51. 51. Text Analytics Introduction2013 Text Analytics Summit54“Sentiment analysis is the task of identifying positive andnegative opinions, emotions, and evaluations.”-- Wilson, Wiebe & Hoffman, 2005, “Recognizing ContextualPolarity in Phrase-Level Sentiment Analysis”“Sentiment analysis or opinion mining is the computationalstudy of opinions, sentiments and emotions expressed intext… An opinion on a feature f is a positive or negativeview, attitude, emotion or appraisal on f from an opinionholder.”-- Bing Liu, 2010, “Sentiment Analysis and Subjectivity,” in Handbookof Natural Language Processing“Dell really... REALLY need to stop overcharging... and when isay overcharing... i mean atleast double what you wouldpay to pick up the ram yourself.”-- From Dell’s IdeaStorm.comSentiment Analysis
  52. 52. Text Analytics Introduction2013 Text Analytics Summit55What are people saying? What’s hot/trending?What are they saying about {topic|person|product} X?... about X versus {topic|person|product} Y?How has opinion about X and Y evolved?How has opinion correlated with {our|competitors’|general}{news|marketing|sales|events}?What’s behind opinion, the root causes?(How) Can we link opinions & transactions?(How) Can we link opinion & intent?Who are opinion leaders?How does sentiment propagate across channels?Sentiment Analysis
  53. 53. Text Analytics Introduction2013 Text Analytics Summit56Sentiment AnalysisApplications include:Brand / Reputation Management.Competitive intelligence.Customer Experience Management.Enterprise Feedback Management.Quality improvement.Trend spotting.There are significant challenges…
  54. 54. Text Analytics Introduction2013 Text Analytics Summit58Complications IllustratedUnfilteredduplicatesExternalreference“Kind” =type, variety,not asentiment.Completemisclassification
  55. 55. Text Analytics Introduction2013 Text Analytics Summit59Sentiment ComplicationsThere are many complications.Sentiment may be of interest at multiple levels.Corpus / data space, i.e., across multiple sources.Document.Statement / sentence.Entity / topic / concept.Human language is noisy and chaotic!Jargon, slang, irony, ambiguity, anaphora, polysemy, synonymy,etc.Context is key. Discourse analysis comes into play.Must distinguish the sentiment holder from the object:“Geithner said the recession may worsen.”
  56. 56. Text Analytics Introduction2013 Text Analytics Summit60Beyond Polarity
  57. 57. Text Analytics Introduction2013 Text Analytics Summit61Intent Analysishttp://www.aiaioo.com/whitepapers/intention_analysis_use_cases.pdfhttp://sentibet.com/
  58. 58. Text Analytics Introduction2013 Text Analytics Summit62ApplicationsText analytics has applications in –• Intelligence & law enforcement.• Life sciences.• Media & publishing including social-media analysis andcontextual advertizing.• Competitive intelligence.• Voice of the Customer: CRM, product management &marketing.• Legal, tax & regulatory (LTR) including compliance.• Recruiting.
  59. 59. Text Analytics Introduction2013 Text Analytics Summit63Online CommerceText analytics is applied for marketing, search optimization,competitive intelligence.Analyze social media and enterprise feedback to understandthe Voice of the Market:• Opportunities• Threats• TrendsCategorize product and service offerings for on-site searchand faceted navigation and to enrich content delivery.Annotate pages to enhance Web-search findability, ranking.Scrape competitor sites for offers and pricing.Analyze social and news media for competitive information.
  60. 60. Text Analytics Introduction2013 Text Analytics Summit64Voice of the CustomerText analytics is applied to enhance customer service andsatisfaction.Analyze customer interactions and opinions –• E-mail, contact-center notes, survey responses• Forum & blog posting and other social media– to –• Address customer product & service issues• Improve quality• Manage brand & reputationIf you can link qualitative information from text you can –• Link feedback to transactions• Assess customer value• Understand root causes• Mine data for measures such as churn likelihood
  61. 61. Text Analytics Introduction2013 Text Analytics Summit65E-Discovery and ComplianceText analytics is applied for compliance, fraud and risk, ande-discovery.Regulatory mandates and corporate practices dictate –• Monitoring corporate communications• Managing electronic stored information for production inevent of litigationSources include e-mail (!!), news, social mediaRisk avoidance and fraud detection are key to effectivedecision making• Text analytics mines critical data from unstructured sources• Integrated text-transactional analytics provides rich insights
  62. 62. Text Analytics Introduction2013 Text Analytics Summit66
  63. 63. Text Analytics Introduction2013 Text Analytics Summit67http://altaplana.com/TA2011
  64. 64. Text Analytics Introduction2013 Text Analytics Summit688%21%25%27%36%21%34%35%44%47%21%22%23%27%29%30%35%35%41%62%0% 10% 20% 30% 40% 50% 60% 70%text messages/SMS/chatWeb-site feedbackcontact-center notes or transcriptsscientific or technical literaturee-mail and correspondencereview sites or forumscustomer/market surveyson-line forumsnews articlesblogs and other social mediaWhat textual information are you analyzing or do you plan toanalyze?2011 (n=215)2009 (n=100)
  65. 65. Text Analytics Introduction2013 Text Analytics Summit69
  66. 66. Text Analytics Introduction2013 Text Analytics Summit70
  67. 67. Text Analytics Introduction2013 Text Analytics Summit71Getting StartedA best practices approach…Assess:• Assess business goals.• Understand information sources.• Consult and educate stakeholders.Evaluate:• Evaluate installed, hosted/SaaS, database-integrated options.• Determine performance and business requirements.• Match methods to goals, sources, and work practices.Implement:• Start with basic functions such as search, modest goals, orwith a single information source.• Go for clear wins to gain support.• Build out applications, capacity, BI/research integration.
  68. 68. Text Analytics Introduction2013 Text Analytics Summit72Software & Platform OptionsText-analytics options may be grouped in general classes.• Installed text-analysis application, whether desktop orserver or deployed in-database.• Data mining workbench.• Hosted.• Programming tool.• As-a-service, via an application programming interface(API).• Code library or component of a business/verticalapplication, for instance for CRM, e-discovery, search.Text analytics is frequently embedded in search or otherend-user applications.The slides that follow next will present leading options ineach category except Hosted…
  69. 69. Text Analytics Introduction2013 Text Analytics Summit73Text Analysis ApplicationsVendors:Attensity, Clarabridge, Daedalus, Digital Reasoning, ExpertSystem, Fido Labs, IBM, Linguamatics, Medallia, NetBase,Open Text (Nstein), Provalis Research, SAP, SAS, SRANetOwl, Sysomos, Temis.Typical uses:Customer experience management (CEM), survey analysis,social-media analysis, law enforcement, life sciences.Typical characteristics:Interface that allows the user to configure a processingpipeline.Interface for text exploration and visualization.Export to databases
  70. 70. Text Analytics Introduction2013 Text Analytics Summit74Data Mining WorkbenchVendors:IBM SPSS Modeler, Megaputer PolyAnalyst, RapidMiner, SASText Analytics, WEKA.Typical uses:Customer experience management (CEM), marketinganalytics, survey analysis, social-media analysis, lawenforcement.Predictive modeling.Typical characteristics:Same as text-analysis applications, but with moresophisticated modeling and analysis capabilities.
  71. 71. Text Analytics Introduction2013 Text Analytics Summit75Programming/Development ToolVendors:GATE, OpenNLP, Python NLTK, R, Stanford NLP – opensource.NooJ – free, non-open source.Typical uses:Language modeling.Data exploration.Up to the programmer.Typical characteristics:Text is an add-in to a programming language/environment.
  72. 72. Text Analytics Introduction2013 Text Analytics Summit76As a Service, via APIVendors:Alchemy API, Clarabridge, Open Amplify, Pingar, Saplo,Semantria (Lexalytics), Thomson Reuters Calais, others.Typical uses:Annotation and content enrichment with the applicationdomain up to the user.Typical characteristics:Relies on remote, server-resident processing resources.May or, more likely, may not be end-user customizable.
  73. 73. Text Analytics Introduction2013 Text Analytics Summit77Code Library or Annotation EngineVendors:Alias-I LingPipe, GATE.Basis Technology Rosette, Lexalytics Salience, SAP Inxight,SAS Teragram, TEMIS Luxid.Typical uses:Information extraction in support of business applications.Typical characteristics:Same as text-analysis applications, but with moresophisticated modeling and analysis capabilities.
  74. 74. Text Analytics Introduction2013 Text Analytics Summit78The Crowdsourcing Alternative (example)
  75. 75. Text Analytics Introduction2013 Text Analytics Summit79http://www.geeklawblog.com/2011/12/lexis-advance-platform-launch-two.htmlhttp://hpccsystems.com/ (GNU Affero GPL)A Big Data Platform (example)
  76. 76. Text Analytics Introduction2013 Text Analytics Summit80
  77. 77. Text Analytics Introduction2013 Text Analytics Summit81(Accessible) Data Everywhere
  78. 78. Text Analytics Introduction2013 Text Analytics Summit82http://open.blogs.nytimes.com/2012/02/16/rnews-is-here-and-this-is-what-it-means/<div itemscope itemtype="http://schema.org/Organization"><span itemprop="name">Google.org (GOOG)</span>Contact Details:<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">Main address:<span itemprop="streetAddress">38 avenue de lOpera</span><span itemprop="postalCode">F-75002</span><span itemprop="addressLocality">Paris, France</span> ,</div>Tel:<span itemprop="telephone">( 33 1) 42 68 53 00 </span>,Fax:<span itemprop="faxNumber">( 33 1) 42 68 53 01 </span>,E-mail: <span itemprop="email">secretariat(at)google.org</span></div>http://schema.org/OrganizationStructure Mattershttp://img.freebase.com/api/trans/raw/m/02dtnzvhttp://www.cambridgesemantics.com/semantic-university/semantic-search-and-the-semantic-web
  79. 79. An Introduction to Text Analytics:Social, Online, and EnterpriseSeth GrimesAlta Plana Corporation@sethgrimesText & Social Analytics Summit 2013WorkshopJune 4, 2013

×