Your SlideShare is downloading. ×

Text Analytics: Yesterday, Today and Tomorrow

4,304
views

Published on

In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore …

In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.

Published in: Technology, Education

1 Comment
7 Likes
Statistics
Notes
  • I have put together a step by step guide on how to implement the text enrichment component in Endeca and perform entity extraction and sentiment analysis

    http://www.business-intelligence-quotient.com/?p=1801
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
4,304
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
1
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Michael Ferretti Tony Russell-Rose Vladimir Zelevinsky Text Analytics: Yesterday Today Tomorrow
  • 2. 2 Part 1 (of 3) WHAT?
  • 3. 3 What is Text Analytics?  A set of linguistic, analytical and predictive techniques to extract structure and meaning extracted from unstructured documents – Text Analytics ~= Natural Language Processing ~= Text Mining – Text Mining → Scientific / technical context, automated processing – Text Analytics → Business context, interactive apps Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential. “NLP” vs. “Text Analytics”
  • 4. 4 Why is Text Analytics Important?  ‘80% of corporate information is unstructured’ – Entire value chain for some organisation (media / publishing etc.) – Retail / eCommerce: Product reviews – User generated content: blogs, forums, wikis – Voice of the Customer: social media + sentiment analysis  161 billion gigabytes of digital information in 2006 – approximately 988 exabytes by 2010 – Audio / video still needs summaries & tags etc. Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 5. 5 How Difficult Can It Be?  As humans we do it effortlessly ... don’t we?  DRUNK GETS NINE YEARS IN VIOLIN CASE  PROSTITUTES APPEAL TO POPE  STOLEN PAINTING FOUND BY TREE  RED TAPE HOLDS UP NEW BRIDGE  DEER KILL 300,000  RESIDENTS CAN DROP OFF TREES  INCLUDE CHILDREN WHEN BAKING COOKIES  MINERS REFUSE TO WORK AFTER DEATH Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 6. 6 Some Fundamentals  Language is AMBIGUOUS – To find structure, we must remove ambiguity!  Lexical analysis (tokenisation) – The cat sat on the mat – I can’t tokenise this sentence  Morphology (term variations, prefixes, suffixes, etc.) – Computer, computing, compute, computed = comput* – Delegate = de-leg-ate (?) – Ratify = rat-ify (?) Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 7. 7 More Fundamentals  Syntax (part of speech tagging) – Time flies like an arrow – Fruit flies like a banana – Eats shoots and leaves  Parsing (grammar) – I saw a venetian blind – I saw a blind venetian – Rugby is a game played by men with odd-shaped balls  Sentence boundary detection – Punctuation denotes the end of a sentence! – “But not always!”, said Fred... Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 8. 8 Named Entity Recognition/Information Extraction  Companies in New York != New companies in York  People, places, organisations ... – Increase precision – Support navigation – Facilitate translation, summarisation, speech synthesis, etc.  IE = template filling – Entities + relationships – Highly context dependent  Problems with: – Anaphora resolution – Word sense disambiguation Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 9. 9 Question Answering  Give me answers, not documents!  Fact-finding vs. exploratory search – Yes/no questions ‘Is George W. Bush the current president of the USA?’ – ‘Who’ questions ‘Who was the British Prime Minister before Margaret Thatcher?’ – List questions ‘Which football teams have won the Champions League this decade? – Instruction-based questions ‘How do I cook lasagne?’ – Explanation questions ‘Why did World War I start?’ – Commands ‘Tell me the height of the Eiffel Tower.’  Question analysis → document retrieval → answer extraction Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 10. 10 Part 2 (of 3) HOW?
  • 11. 11 Text Analytics is Computer Science + Semantics. Semantics is the study of meaning. Definitions Universal flowchart:
  • 12. 12 No mind reading (yet). Have to use text. Text approximates meaning. Meaning is structured. Meaning and structure CONCEPT CONCEPT CONCEPT
  • 13. 13 Synonymy: one concept maps to different words. Polysemy: one word maps to different concepts. Problems with text
  • 14. 14 Simplest structure: salient terms Many years later, as he faced the firing squad, Colonel Aureliano Buendía was to remember that distant afternoon when his father took him to discover ice. – Marquez (1962)
  • 15. 15 Typed entities People, places, organizations; etc. Simple approach: word lists. More difficult: trained extractors (including sentiment).
  • 16. 16 Highest clarity organizations for “baseball”           Top terms                                    World Series Teams 1987               Cardinals; Twins                       Cardinals; Twins 1988                Dodgers; Mets                          Dodgers; Athletics 1989                  Athetics; Giants                        Athetics; Giants 1991                  Braves; Twins                           Braves; Twins 1992                  Blue Jays; Braves                     Blue Jays; Braves 1996                  Yankees; Braves                       Yankees; Braves 1997                  Indians; Marlins                        Indians; Marlins 1998                  Yankees; Padres                       Yankees; Padres 1999                  Braves; Mets; Yankees              Braves; Yankees 2000                  Yankees; Mets                           Yankees; Mets 2001                 Diamondbacks; Yankees           Diamondbacks; Yankees 2003                  Marlins                                 Marlins; Yankees Fail: 1990, 1993, 1995, 2002.
  • 17. 17 Salient terms on a timeline: baseball No event in 1994! Clarity scores for top terms for “baseball” search:
  • 18. 18 Salient terms on a timeline: Iraq
  • 19. 19 Excellent corpus: Research articles. Written by humans. Tagged by authors. Case study: ACM But: Half the articles untagged. Tags sparse (90% of tags used once!) Synonyms abound. tags controlled tag vocabulary high-scoring salient tags→ →
  • 20. 20
  • 21. 21 Co-occurrence: salient terms that tend to occur together belong together. Clusters
  • 22. 22 Clusters for disambiguation
  • 23. 23 Apple, meaning 1
  • 24. 24 Apple, meaning 2
  • 25. 25 Apple, meaning 3
  • 26. 26 Human brain is great at extracting information scent: [word, word, word, …] → meaning Information Scent [island, Indonesia] [code, Sun] [coffee, beans, brew] Java→
  • 27. 27 Vector model – Salton (1983) Similarity between documents = cosine of the angle between their vectors Can also rotate basis for the best representation: LSI
  • 28. 28 Semantic networks emotion →
  • 29. 29 Custom Dimensions
  • 30. 30 Custom Dimensions
  • 31. 31 Sentence structure parsing
  • 32. 32 It is said Mrs. Clinton promises new jobs will be created by her.It is said Mrs. Clinton promises new jobs will be created by her. N V V N N V A N V V V NN V V N N V A N V V V N part of speech tagging noun / verb phrase extraction sentence structure analysis anaphora resolution passive tense flipping triple filtering hierarchy generation Sentence structure parsing
  • 33. 33 Nouns by head noun: [Mrs. + Hillary + Bill + President] → Clinton Verbs by hypernyms (broadening synonyms): [say + tell + propose + suggest + declare] → express Hierarchy generation (also semantic network!)
  • 34. 34 Idea Navigation
  • 35. 35 Idea Navigation
  • 36. 36 Idea Navigation
  • 37. 37 Idea Navigation
  • 38. 38 Part 3 (of 3) WHO?
  • 39. 39  For profit: – Lexalytics  Text Enrichment module  Text Enrichment with Sentiment Analysis – Alias-i  Term Discovery – Nstein  Newssift  For fun: – GATE (Sheffield University)  Open source, linguistic focus – RapidMiner (University of Dortmund / Rapid-I)  Open source community edition, data mining focus – WordNet, OpenCalais, LingPipe, NLPwiki, etc. Text Analytics for fun and profit Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 40. 40  Market maturing & expanding: 75-200% – Most vendors on target  Dominant markets: – CX, media/publishing, FS & insurance, intelligence, life sciences, e- discovery  Solutions still not standardized – Need for self-service tuning & configuration  Massive expansion in social media – Lightweight NLP for buzz analysis, brand monitoring, etc.  Partner ecosystem developing – Marketing services providers, platform vendors, CRM + call centre vendors, system integrators Market Outlook: 2010 Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 41. 41 Conclusions What do we expect in the future?  Extraction leads to generation  Summarization  Generalization  Narratives  Inference and conflict resolution We are all interested in the future, for that is where you and I are going to spend the rest of our lives. And remember, my friend, future events such as these will affect you in the future. – Edward Wood Jr. (1957)
  • 42. 42 Text analytics: what does it mean? Unstructured text isn't unstructured. There's always structure. Find the information scent. Let the users follow it. Don’t trust that one query is enough. Let the users interact. Text does not matter. Meaning does.