• Like
  • Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Text Analytics: Yesterday, Today and Tomorrow

  • 4,121 views
Uploaded on

In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore …

In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • I have put together a step by step guide on how to implement the text enrichment component in Endeca and perform entity extraction and sentiment analysis

    http://www.business-intelligence-quotient.com/?p=1801
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
4,121
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
0
Comments
1
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Michael Ferretti Tony Russell-Rose Vladimir Zelevinsky Text Analytics: Yesterday Today Tomorrow
  • 2. Part 1 (of 3)
    • WHAT?
  • 3. What is Text Analytics?
    • A set of linguistic, analytical and predictive techniques to extract structure and meaning extracted from unstructured documents
      • Text Analytics ~= Natural Language Processing ~= Text Mining
      • Text Mining -> Scientific / technical context, automated processing
      • Text Analytics -> Business context, interactive apps
    Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential. “ NLP ” vs. “ Text Analytics ”
  • 4. Why is Text Analytics Important?
    • ‘ 80% of corporate information is unstructured’
      • Entire value chain for some organisation (media / publishing etc.)
      • Retail / eCommerce: Product reviews
      • User generated content: blogs, forums, wikis
      • Voice of the Customer: social media + sentiment analysis
    • 161 billion gigabytes of digital information in 2006
      • approximately 988 exabytes by 2010
      • Audio / video still needs summaries & tags etc.
    Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 5. How Difficult Can It Be?
    • As humans we do it effortlessly ... don’t we?
    • DRUNK GETS NINE YEARS IN VIOLIN CASE
    • PROSTITUTES APPEAL TO POPE
    • STOLEN PAINTING FOUND BY TREE
    • RED TAPE HOLDS UP NEW BRIDGE
    • DEER KILL 300,000
    • RESIDENTS CAN DROP OFF TREES
    • INCLUDE CHILDREN WHEN BAKING COOKIES
    • MINERS REFUSE TO WORK AFTER DEATH  
    Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 6. Some Fundamentals
    • Language is AMBIGUOUS
      • To find structure, we must remove ambiguity!
    • Lexical analysis (tokenisation)
      • The cat sat on the mat
      • I can’t tokenise this sentence
    • Morphology (term variations, prefixes, suffixes, etc.)
      • Computer, computing, compute, computed = comput*
      • Delegate = de-leg-ate (?)
      • Ratify = rat-ify (?)
    Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 7. More Fundamentals
    • Syntax (part of speech tagging)
      • Time flies like an arrow
      • Fruit flies like a banana
      • Eats shoots and leaves
    • Parsing (grammar)
      • I saw a venetian blind
      • I saw a blind venetian
      • Rugby is a game played by men with odd-shaped balls
    • Sentence boundary detection
      • Punctuation denotes the end of a sentence!
      • “ But not always!”, said Fred...
    Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 8. Named Entity Recognition/Information Extraction
    • Companies in New York != New companies in York
    • People, places, organisations ...
      • Increase precision
      • Support navigation
      • Facilitate translation, summarisation, speech synthesis, etc.
    • IE = template filling
      • Entities + relationships
      • Highly context dependent
    • Problems with:
      • Anaphora resolution
      • Word sense disambiguation
    Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 9. Question Answering
    • Give me answers, not documents!
    • Fact-finding vs. exploratory search
      • Yes/no questions ‘Is George W. Bush the current president of the USA?’
      • ‘ Who’ questions ‘Who was the British Prime Minister before Margaret Thatcher?’
      • List questions ‘Which football teams have won the Champions League this decade?
      • Instruction-based questions ‘How do I cook lasagne?’
      • Explanation questions ‘Why did World War I start?’
      • Commands ‘Tell me the height of the Eiffel Tower.’
    • Question analysis -> document retrieval -> answer extraction
    Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 10. Part 2 (of 3)
    • HOW?
  • 11.
    • Text Analytics is Computer Science + Semantics .
    • Semantics is the study of meaning .
    Definitions Universal flowchart:
  • 12.
    • No mind reading (yet). Have to use text .
    • Text approximates meaning .
    • Meaning is structured .
    Meaning and structure CONCEPT CONCEPT CONCEPT
  • 13.
    • Synonymy :
    • one concept maps to different words.
    • Polysemy :
    • one word maps to different concepts.
    Problems with text
  • 14. Simplest structure: salient terms Many years later, as he faced the firing squad , Colonel Aureliano Buendía was to remember that distant afternoon when his father took him to discover ice . – Marquez (1962)
  • 15. Typed entities People , places , organizations ; etc. Simple approach: word lists . More difficult: trained extractors (including sentiment ).
  • 16. Highest clarity organizations for “baseball”          Top terms                                 World Series Teams 1987               Cardinals; Twins                       Cardinals; Twins 1988                Dodgers; Mets                           Dodgers; Athletics 1989                 Athetics; Giants                        Athetics; Giants 1991                 Braves; Twins                           Braves; Twins 1992                 Blue Jays; Braves                     Blue Jays; Braves 1996                 Yankees; Braves                       Yankees; Braves 1997                 Indians; Marlins                       Indians; Marlins 1998                 Yankees; Padres                      Yankees; Padres 1999                 Braves; Mets ; Yankees             Braves; Yankees 2000                 Yankees; Mets                          Yankees; Mets 2001                Diamondbacks; Yankees          Diamondbacks; Yankees 2003                 Marlins                                 Marlins; Yankees Fail: 1990, 1993, 1995, 2002.
  • 17. Salient terms on a timeline: baseball Clarity scores for top terms for “ baseball ” search: No event in 1994!
  • 18. Salient terms on a timeline: Iraq
  • 19.
    • Excellent corpus:
    • Research articles.
    • Written by humans.
    • Tagged by authors.
    Case study: ACM But: Half the articles untagged. Tags sparse (90% of tags used once!) Synonyms abound. tags -> controlled tag vocabulary -> high-scoring salient tags
  • 20.  
  • 21.
    • Co-occurrence :
    • salient terms that tend to occur together belong together.
    Clusters
  • 22. Clusters for disambiguation
  • 23. Apple, meaning 1
  • 24. Apple, meaning 2
  • 25. Apple, meaning 3
  • 26.
    • Human brain is great at extracting information scent:
    • [word, word, word, …] -> meaning
    Information Scent [ island, Indonesia ] [ code, Sun ] [ coffee, beans, brew ] -> Java
  • 27. Vector model Can also rotate basis for the best representation: LSI – Salton (1983) Similarity between documents = cosine of the angle between their vectors
  • 28. Semantic networks emotion ->
  • 29. Custom Dimensions
  • 30. Custom Dimensions
  • 31. Sentence structure parsing
  • 32. It is said Mrs. Clinton promises new jobs will be created by her. N V V N N V A N V V V N part of speech tagging noun / verb phrase extraction sentence structure analysis anaphora resolution passive tense flipping triple filtering hierarchy generation Sentence structure parsing
  • 33. Nouns by head noun: [ Mrs. + Hillary + Bill + President ] -> Clinton Verbs by hypernyms (broadening synonyms): [ say + tell + propose + suggest + declare ] -> express Hierarchy generation (also semantic network!)
  • 34. Idea Navigation
  • 35. Idea Navigation
  • 36. Idea Navigation
  • 37. Idea Navigation
  • 38. Part 3 (of 3)
    • WHO?
  • 39.
    • For profit :
      • Lexalytics
        • Text Enrichment module
        • Text Enrichment with Sentiment Analysis
      • Alias-i
        • Term Discovery
      • Nstein
        • Newssift
    • For fun :
      • GATE (Sheffield University)
        • Open source, linguistic focus
      • RapidMiner (University of Dortmund / Rapid-I)
        • Open source community edition, data mining focus
      • WordNet, OpenCalais, LingPipe, NLPwiki, etc.
    Text Analytics for fun and profit Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 40.
    • Market maturing & expanding: 75-200 %
      • Most vendors on target
    • Dominant markets:
      • CX, media/publishing, FS & insurance, intelligence, life sciences, e-discovery
    • Solutions still not standardized
      • Need for self-service tuning & configuration
    • Massive expansion in social media
      • Lightweight NLP for buzz analysis, brand monitoring, etc.
    • Partner ecosystem developing
      • Marketing services providers, platform vendors, CRM + call centre vendors, system integrators
    Market Outlook: 2010 Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 41. Conclusions We are all interested in the future, for that is where you and I are going to spend the rest of our lives. And remember, my friend, future events such as these will affect you in the future. – Edward Wood Jr. (1957)
    • What do we expect in the future?
    • Extraction leads to generation
    • Summarization
    • Generalization
    • Narratives
    • Inference and conflict resolution
  • 42. Text analytics: what does it mean? Unstructured text isn't unstructured. There's always structure. Find the information scent. Let the users follow it. Don’t trust that one query is enough. Let the users interact. Text does not matter. Meaning does.