Perfect Text Analytics	Seth RedmoreVP, Product Management
Perfectper·fect    [adj., n. pur-fikt; v. per-fekt]1. conforming absolutely to the description or definition of an ideal type: a perfect sphere; a perfect gentleman.2. excellent or complete beyond practical or theoretical improvement: There is no perfect legal code. The proportions of this temple are almost perfect.2All right reserved © 2010 Lexalytics Inc.
Text AnalyticsThe term text analytics describes a set of linguistic statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)In other words, enhancing the value of text content by extracting entities, features, context, relationships and emotion.3All right reserved © 2010 Lexalytics Inc.
Perfect is FastAverage Human Reading Speed:  250wpmConservative computer reading speed: 6000 wpm/core (our speed on a moderate single core)Each core is equivalent to the reading bandwidth of 12 people.Modern machines have 8 cores. That’s just about 100 people in a box.  Nice.4All right reserved © 2010 Lexalytics Inc.
Perfect is Useable“I don’t like the results” is not the same as “the results are incorrect”Understanding the behavior key to usefulnessCan you make better decisions?Can you make more money or save money?What is the most controversial area of text analytics?Thompson Reuters trading w/Sentiment Analysis increased Alpha (profit over market) by 80 basis points5All right reserved © 2010 Lexalytics Inc.
Useable: How much can you differ?“In my shop, that up until now has relied exclusively on human coding, we consider anything below 90% to be unacceptably inaccurate…. There is no doubt that automated sentiment is getting much much better, but to suggest that people should be okay with 20% of their data being wrong is just absurd.”  Katie Delahaye PayneWhy is 10% “wrong” so much less absurd than 20% “wrong”?20% Error10% Error6All right reserved © 2010 Lexalytics Inc.
Perfect is ConsistentSame results for same content, every timeUniversity of Pittsburgh “Multi-Perspective Question Answering” Corpus:  535 documents, 11k+ sentences.  40 hours of training for each rater~80% inter-rater agreement7All right reserved © 2010 Lexalytics Inc.
Perfect is (new) KnowledgeDiscover the stuff you don’t knowText Analytics is really, really great at telling you the who, the what, and the where.  Sometimes the “how”You have to supply the “why” – but that question is way easier to answer when you know the other “w’s and the h”8All right reserved © 2010 Lexalytics Inc.
Perfect Includes EverythingRunning our top of the line software flat out across one year will cost you about $.002/document analyzed (news article sized content) (assuming 3 docs/core-second, 8 core machine)The more data the better and the greater worth your ta has9All right reserved © 2010 Lexalytics Inc.
Perfect is TrainableCan you solve YOUR business problem with it?Can you optimize to suit different kinds of content and roll those results up into a single reporting system?10All right reserved © 2010 Lexalytics Inc.
Perfect Text Analytics11All right reserved © 2010 Lexalytics Inc.FastUseableConsistentKnowledge(that is)InclusiveTrainable
Customer Snapshots(or, “rubber, meet road”)
Reputation Management13All right reserved © 2010 Lexalytics Inc.
Politics14All right reserved © 2010 Lexalytics Inc.
Market IntelligenceClient EmployeeUser AuthenticationSingle Sign-onExternal Content ProvidersSinglePointClient CompanyUser AuthenticationWeb 2.0CollaborationSearch ResultsSecondaryResearchSuppliersUser AuthenticationMI Analyst Text AnalyticsIntegrated IndexNews& Journals NL Search EngineFIREWALLInternalDocument RepositoryOptionalDocument RepositoryFinancial analyst reportsInternal  researchContent ProcessingCustom Web Crawls & Gov.DatabasesTrashcancrawl, FTPor CD15All right reserved © 2010 Lexalytics Inc.
Hospitality16All right reserved © 2010 Lexalytics Inc.
Financial ServicesTurns News into numbers for automatic trading systemsCompany stocks + Commodities
Resilient server productAll right reserved © 2010 Lexalytics Inc.17AlgorithmicTrading(QED firm)Financial dataIndicatorsBuy/SellRNSEServerIndicatorsUltimate customers are financial institutions
QED (Quantitative and Event-Driven Trading) Banks, hedge funds.
JPMorgan, SocGen, Alpha Equities…and othersROI – Retrieving Organized InformationRTI CONSULTING SERVICESREPEATABLEEVOLVINGDESIGNSBALANCED METHODOLOGYBusiness AssessmentUser InterviewsTaxonomy Design and RecommendationContent Governance /  AnalysisDEPLOYMENT / SUPPORTSolution AlternativesIntegration & DeploymentTesting, Tuning, and EvaluationTHOUGHT LEADERSHIPStrategy ConsultationRoadmaps – Evolution and Growth PROF. TED SULLIVAN
Pharma19All right reserved © 2010 Lexalytics Inc.
The Next Year…
Opinion MiningWho said what about whom?All right reserved © 2010 Lexalytics Inc.21
Sarcasm, TwitterModel trained to detect sarcasmOnce detected, you can decide what to do with it – because actually determining the sentiment is going to be unreliableNew model trained on Twitter contentMoving towards a concept of text analytics driven by business logicAll right reserved © 2010 Lexalytics Inc.22
Thesaurus-based Theme RollupMachine generated conceptual taxonomyGas/Electric Hybrid and EV might roll up to EVFewer themes, but very useful to detect patterns across contentAll right reserved © 2010 Lexalytics Inc.23
Foreign Language SupportFrench is first, followed by other Romance languagesNew stemmerNew summarization algorithmNew part-of-speech taggerAutomatic language detectionNew sentiment/entity extraction algorithmsAlso applicable to vertical specific contentConfidence scoring by algorithmUse business logic to meld the resultsAll right reserved © 2010 Lexalytics Inc.24
Trainable Entity SentimentNew technique for entity sentimentInitial results from testing in English extremely promisingAverage human scoring overlap of >> 90% for scored sentencesInitially used only for French25All right reserved © 2010 Lexalytics Inc.
Tool EnhancementsEventually use on English content:TwitterCustomer SatisfactionOthers…Entity Management Toolkit  Part of Speech Tagset trainingUsing to train Salience on FrenchSentiment ToolkitBuild your own entity sentiment models:French (first)New Sentiment Toolkit + Maximum Entropy  model builder allows new Entity and Sentiment modulesNew EMT helps us build a new French PoS taggerEntity Extraction& Sentiment ModelsFully TaggedDocumentDocPOS Tagger26All right reserved © 2010 Lexalytics Inc.Themes&Summaries

Lexalytics Text Analytics Workshop: Perfect Text Analytics

  • 1.
    Perfect Text Analytics SethRedmoreVP, Product Management
  • 2.
    Perfectper·fect    [adj.,n. pur-fikt; v. per-fekt]1. conforming absolutely to the description or definition of an ideal type: a perfect sphere; a perfect gentleman.2. excellent or complete beyond practical or theoretical improvement: There is no perfect legal code. The proportions of this temple are almost perfect.2All right reserved © 2010 Lexalytics Inc.
  • 3.
    Text AnalyticsThe termtext analytics describes a set of linguistic statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)In other words, enhancing the value of text content by extracting entities, features, context, relationships and emotion.3All right reserved © 2010 Lexalytics Inc.
  • 4.
    Perfect is FastAverageHuman Reading Speed: 250wpmConservative computer reading speed: 6000 wpm/core (our speed on a moderate single core)Each core is equivalent to the reading bandwidth of 12 people.Modern machines have 8 cores. That’s just about 100 people in a box. Nice.4All right reserved © 2010 Lexalytics Inc.
  • 5.
    Perfect is Useable“Idon’t like the results” is not the same as “the results are incorrect”Understanding the behavior key to usefulnessCan you make better decisions?Can you make more money or save money?What is the most controversial area of text analytics?Thompson Reuters trading w/Sentiment Analysis increased Alpha (profit over market) by 80 basis points5All right reserved © 2010 Lexalytics Inc.
  • 6.
    Useable: How muchcan you differ?“In my shop, that up until now has relied exclusively on human coding, we consider anything below 90% to be unacceptably inaccurate…. There is no doubt that automated sentiment is getting much much better, but to suggest that people should be okay with 20% of their data being wrong is just absurd.” Katie Delahaye PayneWhy is 10% “wrong” so much less absurd than 20% “wrong”?20% Error10% Error6All right reserved © 2010 Lexalytics Inc.
  • 7.
    Perfect is ConsistentSameresults for same content, every timeUniversity of Pittsburgh “Multi-Perspective Question Answering” Corpus: 535 documents, 11k+ sentences. 40 hours of training for each rater~80% inter-rater agreement7All right reserved © 2010 Lexalytics Inc.
  • 8.
    Perfect is (new)KnowledgeDiscover the stuff you don’t knowText Analytics is really, really great at telling you the who, the what, and the where. Sometimes the “how”You have to supply the “why” – but that question is way easier to answer when you know the other “w’s and the h”8All right reserved © 2010 Lexalytics Inc.
  • 9.
    Perfect Includes EverythingRunningour top of the line software flat out across one year will cost you about $.002/document analyzed (news article sized content) (assuming 3 docs/core-second, 8 core machine)The more data the better and the greater worth your ta has9All right reserved © 2010 Lexalytics Inc.
  • 10.
    Perfect is TrainableCanyou solve YOUR business problem with it?Can you optimize to suit different kinds of content and roll those results up into a single reporting system?10All right reserved © 2010 Lexalytics Inc.
  • 11.
    Perfect Text Analytics11Allright reserved © 2010 Lexalytics Inc.FastUseableConsistentKnowledge(that is)InclusiveTrainable
  • 12.
  • 13.
    Reputation Management13All rightreserved © 2010 Lexalytics Inc.
  • 14.
    Politics14All right reserved© 2010 Lexalytics Inc.
  • 15.
    Market IntelligenceClient EmployeeUserAuthenticationSingle Sign-onExternal Content ProvidersSinglePointClient CompanyUser AuthenticationWeb 2.0CollaborationSearch ResultsSecondaryResearchSuppliersUser AuthenticationMI Analyst Text AnalyticsIntegrated IndexNews& Journals NL Search EngineFIREWALLInternalDocument RepositoryOptionalDocument RepositoryFinancial analyst reportsInternal researchContent ProcessingCustom Web Crawls & Gov.DatabasesTrashcancrawl, FTPor CD15All right reserved © 2010 Lexalytics Inc.
  • 16.
    Hospitality16All right reserved© 2010 Lexalytics Inc.
  • 17.
    Financial ServicesTurns Newsinto numbers for automatic trading systemsCompany stocks + Commodities
  • 18.
    Resilient server productAllright reserved © 2010 Lexalytics Inc.17AlgorithmicTrading(QED firm)Financial dataIndicatorsBuy/SellRNSEServerIndicatorsUltimate customers are financial institutions
  • 19.
    QED (Quantitative andEvent-Driven Trading) Banks, hedge funds.
  • 20.
    JPMorgan, SocGen, AlphaEquities…and othersROI – Retrieving Organized InformationRTI CONSULTING SERVICESREPEATABLEEVOLVINGDESIGNSBALANCED METHODOLOGYBusiness AssessmentUser InterviewsTaxonomy Design and RecommendationContent Governance / AnalysisDEPLOYMENT / SUPPORTSolution AlternativesIntegration & DeploymentTesting, Tuning, and EvaluationTHOUGHT LEADERSHIPStrategy ConsultationRoadmaps – Evolution and Growth PROF. TED SULLIVAN
  • 21.
    Pharma19All right reserved© 2010 Lexalytics Inc.
  • 22.
  • 23.
    Opinion MiningWho saidwhat about whom?All right reserved © 2010 Lexalytics Inc.21
  • 24.
    Sarcasm, TwitterModel trainedto detect sarcasmOnce detected, you can decide what to do with it – because actually determining the sentiment is going to be unreliableNew model trained on Twitter contentMoving towards a concept of text analytics driven by business logicAll right reserved © 2010 Lexalytics Inc.22
  • 25.
    Thesaurus-based Theme RollupMachinegenerated conceptual taxonomyGas/Electric Hybrid and EV might roll up to EVFewer themes, but very useful to detect patterns across contentAll right reserved © 2010 Lexalytics Inc.23
  • 26.
    Foreign Language SupportFrenchis first, followed by other Romance languagesNew stemmerNew summarization algorithmNew part-of-speech taggerAutomatic language detectionNew sentiment/entity extraction algorithmsAlso applicable to vertical specific contentConfidence scoring by algorithmUse business logic to meld the resultsAll right reserved © 2010 Lexalytics Inc.24
  • 27.
    Trainable Entity SentimentNewtechnique for entity sentimentInitial results from testing in English extremely promisingAverage human scoring overlap of >> 90% for scored sentencesInitially used only for French25All right reserved © 2010 Lexalytics Inc.
  • 28.
    Tool EnhancementsEventually useon English content:TwitterCustomer SatisfactionOthers…Entity Management Toolkit Part of Speech Tagset trainingUsing to train Salience on FrenchSentiment ToolkitBuild your own entity sentiment models:French (first)New Sentiment Toolkit + Maximum Entropy model builder allows new Entity and Sentiment modulesNew EMT helps us build a new French PoS taggerEntity Extraction& Sentiment ModelsFully TaggedDocumentDocPOS Tagger26All right reserved © 2010 Lexalytics Inc.Themes&Summaries