Michael Ferretti
Tony Russell-Rose
Vladimir Zelevinsky
Text Analytics:
Yesterday
Today
Tomorrow
2
Part 1 (of 3)
WHAT?
3
What is Text Analytics?
 A set of linguistic, analytical and predictive techniques to extract
structure and meaning extracted from unstructured documents
– Text Analytics ~= Natural Language Processing ~= Text Mining
– Text Mining → Scientific / technical context, automated processing
– Text Analytics → Business context, interactive apps
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
“NLP”
vs.
“Text Analytics”
4
Why is Text Analytics Important?
 ‘80% of corporate information is unstructured’
– Entire value chain for some organisation (media / publishing etc.)
– Retail / eCommerce: Product reviews
– User generated content: blogs, forums, wikis
– Voice of the Customer: social media + sentiment analysis
 161 billion gigabytes of digital information in 2006
– approximately 988 exabytes by 2010
– Audio / video still needs summaries & tags etc.
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
5
How Difficult Can It Be?
 As humans we do it effortlessly ... don’t we?
 DRUNK GETS NINE YEARS IN VIOLIN CASE
 PROSTITUTES APPEAL TO POPE
 STOLEN PAINTING FOUND BY TREE
 RED TAPE HOLDS UP NEW BRIDGE
 DEER KILL 300,000
 RESIDENTS CAN DROP OFF TREES
 INCLUDE CHILDREN WHEN BAKING COOKIES
 MINERS REFUSE TO WORK AFTER DEATH
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
6
Some Fundamentals
 Language is AMBIGUOUS
– To find structure, we must remove ambiguity!
 Lexical analysis (tokenisation)
– The cat sat on the mat
– I can’t tokenise this sentence
 Morphology (term variations, prefixes, suffixes, etc.)
– Computer, computing, compute, computed = comput*
– Delegate = de-leg-ate (?)
– Ratify = rat-ify (?)
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
7
More Fundamentals
 Syntax (part of speech tagging)
– Time flies like an arrow
– Fruit flies like a banana
– Eats shoots and leaves
 Parsing (grammar)
– I saw a venetian blind
– I saw a blind venetian
– Rugby is a game played by men with odd-shaped balls
 Sentence boundary detection
– Punctuation denotes the end of a sentence!
– “But not always!”, said Fred...
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
8
Named Entity Recognition/Information Extraction
 Companies in New York != New companies in York
 People, places, organisations ...
– Increase precision
– Support navigation
– Facilitate translation, summarisation, speech synthesis, etc.
 IE = template filling
– Entities + relationships
– Highly context dependent
 Problems with:
– Anaphora resolution
– Word sense disambiguation
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
9
Question Answering
 Give me answers, not documents!
 Fact-finding vs. exploratory search
– Yes/no questions ‘Is George W. Bush the current president of the
USA?’
– ‘Who’ questions ‘Who was the British Prime Minister before Margaret
Thatcher?’
– List questions ‘Which football teams have won the Champions
League this decade?
– Instruction-based questions ‘How do I cook lasagne?’
– Explanation questions ‘Why did World War I start?’
– Commands ‘Tell me the height of the Eiffel Tower.’
 Question analysis → document retrieval → answer extraction
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
10
Part 2 (of 3)
HOW?
11
Text Analytics is Computer Science + Semantics.
Semantics is the study of meaning.
Definitions
Universal flowchart:
12
No mind reading (yet). Have to use text.
Text approximates meaning.
Meaning is structured.
Meaning and structure
CONCEPT CONCEPT CONCEPT
13
Synonymy:
one concept maps to different words.
Polysemy:
one word maps to different concepts.
Problems with text
14
Simplest structure: salient terms
Many years later, as he faced the firing squad, Colonel
Aureliano Buendía was to remember that distant afternoon
when his father took him to discover ice.
– Marquez (1962)
15
Typed entities
People, places, organizations; etc.
Simple approach: word lists.
More difficult: trained extractors (including sentiment).
16
Highest clarity organizations for “baseball”
Top terms World Series Teams
1987 Cardinals; Twins Cardinals; Twins
1988 Dodgers; Mets Dodgers; Athletics
1989 Athetics; Giants Athetics; Giants
1991 Braves; Twins Braves; Twins
1992 Blue Jays; Braves Blue Jays; Braves
1996 Yankees; Braves Yankees; Braves
1997 Indians; Marlins Indians; Marlins
1998 Yankees; Padres Yankees; Padres
1999 Braves; Mets; Yankees Braves; Yankees
2000 Yankees; Mets Yankees; Mets
2001 Diamondbacks; Yankees Diamondbacks; Yankees
2003 Marlins Marlins; Yankees
Fail: 1990, 1993, 1995, 2002.
17
Salient terms on a timeline: baseball
No event in 1994!
Clarity scores for top terms for “baseball” search:
18
Salient terms on a timeline: Iraq
19
Excellent corpus:
Research articles.
Written by humans.
Tagged by authors.
Case study: ACM
But:
Half the articles untagged.
Tags sparse (90% of tags used once!)
Synonyms abound.
tags → controlled tag vocabulary → high-scoring salient tags
20
21
Co-occurrence:
salient terms that tend to occur together belong together.
Clusters
22
Clusters for disambiguation
23
Apple, meaning 1
24
Apple, meaning 2
25
Apple, meaning 3
26
Human brain is great at extracting information scent:
[word, word, word, …] → meaning
Information Scent
[island, Indonesia] [code, Sun] [coffee, beans, brew] → Java
27
Vector model
– Salton (1983)
Similarity between documents = cosine of the angle between their vectors
Can also rotate basis for the best representation: LSI
28
Semantic networks
emotion →
29
Custom Dimensions
30
Custom Dimensions
31
Sentence structure parsing
32
It is said Mrs. Clinton promises new jobs will be created by her.
N V V N N V A N V V V N
part of speech tagging
noun / verb phrase extraction
sentence structure analysis
anaphora resolution
passive tense flipping
triple filtering
hierarchy generation
Sentence structure parsing
33
Nouns by head noun:
[Mrs. + Hillary + Bill + President]
→ Clinton
Verbs by hypernyms (broadening synonyms):
[say + tell + propose + suggest + declare]
→ express
Hierarchy generation (also semantic network!)
34
Idea Navigation
35
Idea Navigation
36
Idea Navigation
37
Idea Navigation
38
Part 3 (of 3)
WHO?
39
 For profit:
– Lexalytics
 Text Enrichment module
 Text Enrichment with Sentiment Analysis
– Alias-i
 Term Discovery
– Nstein
 Newssift
 For fun:
– GATE (Sheffield University)
 Open source, linguistic focus
– RapidMiner (University of Dortmund / Rapid-I)
 Open source community edition, data mining focus
– WordNet, OpenCalais, LingPipe, NLPwiki, etc.
Text Analytics for fun and profit
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
40
 Market maturing & expanding: 75-200%
– Most vendors on target
 Dominant markets:
– CX, media/publishing, FS & insurance, intelligence, life sciences, e-
discovery
 Solutions still not standardized
– Need for self-service tuning & configuration
 Massive expansion in social media
– Lightweight NLP for buzz analysis, brand monitoring, etc.
 Partner ecosystem developing
– Marketing services providers, platform vendors, CRM + call centre
vendors, system integrators
Market Outlook: 2010
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
41
Conclusions
What do we expect in the future?
 Extraction leads to generation
 Summarization
 Generalization
 Narratives
 Inference and conflict resolution
We are all interested in the future, for that is where you and I
are going to spend the rest of our lives. And remember, my
friend, future events such as these will affect you in the future.
– Edward Wood Jr. (1957)
42
Text analytics: what does it mean?
Unstructured text isn't unstructured. There's always structure.
Find the information scent. Let the users follow it.
Don’t trust that one query is enough. Let the users interact.
Text does not matter. Meaning does.

Text Analytics: Yesterday, Today and Tomorrow

  • 1.
    Michael Ferretti Tony Russell-Rose VladimirZelevinsky Text Analytics: Yesterday Today Tomorrow
  • 2.
    2 Part 1 (of3) WHAT?
  • 3.
    3 What is TextAnalytics?  A set of linguistic, analytical and predictive techniques to extract structure and meaning extracted from unstructured documents – Text Analytics ~= Natural Language Processing ~= Text Mining – Text Mining → Scientific / technical context, automated processing – Text Analytics → Business context, interactive apps Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential. “NLP” vs. “Text Analytics”
  • 4.
    4 Why is TextAnalytics Important?  ‘80% of corporate information is unstructured’ – Entire value chain for some organisation (media / publishing etc.) – Retail / eCommerce: Product reviews – User generated content: blogs, forums, wikis – Voice of the Customer: social media + sentiment analysis  161 billion gigabytes of digital information in 2006 – approximately 988 exabytes by 2010 – Audio / video still needs summaries & tags etc. Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 5.
    5 How Difficult CanIt Be?  As humans we do it effortlessly ... don’t we?  DRUNK GETS NINE YEARS IN VIOLIN CASE  PROSTITUTES APPEAL TO POPE  STOLEN PAINTING FOUND BY TREE  RED TAPE HOLDS UP NEW BRIDGE  DEER KILL 300,000  RESIDENTS CAN DROP OFF TREES  INCLUDE CHILDREN WHEN BAKING COOKIES  MINERS REFUSE TO WORK AFTER DEATH Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 6.
    6 Some Fundamentals  Languageis AMBIGUOUS – To find structure, we must remove ambiguity!  Lexical analysis (tokenisation) – The cat sat on the mat – I can’t tokenise this sentence  Morphology (term variations, prefixes, suffixes, etc.) – Computer, computing, compute, computed = comput* – Delegate = de-leg-ate (?) – Ratify = rat-ify (?) Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 7.
    7 More Fundamentals  Syntax(part of speech tagging) – Time flies like an arrow – Fruit flies like a banana – Eats shoots and leaves  Parsing (grammar) – I saw a venetian blind – I saw a blind venetian – Rugby is a game played by men with odd-shaped balls  Sentence boundary detection – Punctuation denotes the end of a sentence! – “But not always!”, said Fred... Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 8.
    8 Named Entity Recognition/InformationExtraction  Companies in New York != New companies in York  People, places, organisations ... – Increase precision – Support navigation – Facilitate translation, summarisation, speech synthesis, etc.  IE = template filling – Entities + relationships – Highly context dependent  Problems with: – Anaphora resolution – Word sense disambiguation Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 9.
    9 Question Answering  Giveme answers, not documents!  Fact-finding vs. exploratory search – Yes/no questions ‘Is George W. Bush the current president of the USA?’ – ‘Who’ questions ‘Who was the British Prime Minister before Margaret Thatcher?’ – List questions ‘Which football teams have won the Champions League this decade? – Instruction-based questions ‘How do I cook lasagne?’ – Explanation questions ‘Why did World War I start?’ – Commands ‘Tell me the height of the Eiffel Tower.’  Question analysis → document retrieval → answer extraction Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 10.
  • 11.
    11 Text Analytics isComputer Science + Semantics. Semantics is the study of meaning. Definitions Universal flowchart:
  • 12.
    12 No mind reading(yet). Have to use text. Text approximates meaning. Meaning is structured. Meaning and structure CONCEPT CONCEPT CONCEPT
  • 13.
    13 Synonymy: one concept mapsto different words. Polysemy: one word maps to different concepts. Problems with text
  • 14.
    14 Simplest structure: salientterms Many years later, as he faced the firing squad, Colonel Aureliano Buendía was to remember that distant afternoon when his father took him to discover ice. – Marquez (1962)
  • 15.
    15 Typed entities People, places,organizations; etc. Simple approach: word lists. More difficult: trained extractors (including sentiment).
  • 16.
    16 Highest clarity organizationsfor “baseball” Top terms World Series Teams 1987 Cardinals; Twins Cardinals; Twins 1988 Dodgers; Mets Dodgers; Athletics 1989 Athetics; Giants Athetics; Giants 1991 Braves; Twins Braves; Twins 1992 Blue Jays; Braves Blue Jays; Braves 1996 Yankees; Braves Yankees; Braves 1997 Indians; Marlins Indians; Marlins 1998 Yankees; Padres Yankees; Padres 1999 Braves; Mets; Yankees Braves; Yankees 2000 Yankees; Mets Yankees; Mets 2001 Diamondbacks; Yankees Diamondbacks; Yankees 2003 Marlins Marlins; Yankees Fail: 1990, 1993, 1995, 2002.
  • 17.
    17 Salient terms ona timeline: baseball No event in 1994! Clarity scores for top terms for “baseball” search:
  • 18.
    18 Salient terms ona timeline: Iraq
  • 19.
    19 Excellent corpus: Research articles. Writtenby humans. Tagged by authors. Case study: ACM But: Half the articles untagged. Tags sparse (90% of tags used once!) Synonyms abound. tags → controlled tag vocabulary → high-scoring salient tags
  • 20.
  • 21.
    21 Co-occurrence: salient terms thattend to occur together belong together. Clusters
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    26 Human brain isgreat at extracting information scent: [word, word, word, …] → meaning Information Scent [island, Indonesia] [code, Sun] [coffee, beans, brew] → Java
  • 27.
    27 Vector model – Salton(1983) Similarity between documents = cosine of the angle between their vectors Can also rotate basis for the best representation: LSI
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
    32 It is saidMrs. Clinton promises new jobs will be created by her. N V V N N V A N V V V N part of speech tagging noun / verb phrase extraction sentence structure analysis anaphora resolution passive tense flipping triple filtering hierarchy generation Sentence structure parsing
  • 33.
    33 Nouns by headnoun: [Mrs. + Hillary + Bill + President] → Clinton Verbs by hypernyms (broadening synonyms): [say + tell + propose + suggest + declare] → express Hierarchy generation (also semantic network!)
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    39  For profit: –Lexalytics  Text Enrichment module  Text Enrichment with Sentiment Analysis – Alias-i  Term Discovery – Nstein  Newssift  For fun: – GATE (Sheffield University)  Open source, linguistic focus – RapidMiner (University of Dortmund / Rapid-I)  Open source community edition, data mining focus – WordNet, OpenCalais, LingPipe, NLPwiki, etc. Text Analytics for fun and profit Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 40.
    40  Market maturing& expanding: 75-200% – Most vendors on target  Dominant markets: – CX, media/publishing, FS & insurance, intelligence, life sciences, e- discovery  Solutions still not standardized – Need for self-service tuning & configuration  Massive expansion in social media – Lightweight NLP for buzz analysis, brand monitoring, etc.  Partner ecosystem developing – Marketing services providers, platform vendors, CRM + call centre vendors, system integrators Market Outlook: 2010 Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 41.
    41 Conclusions What do weexpect in the future?  Extraction leads to generation  Summarization  Generalization  Narratives  Inference and conflict resolution We are all interested in the future, for that is where you and I are going to spend the rest of our lives. And remember, my friend, future events such as these will affect you in the future. – Edward Wood Jr. (1957)
  • 42.
    42 Text analytics: whatdoes it mean? Unstructured text isn't unstructured. There's always structure. Find the information scent. Let the users follow it. Don’t trust that one query is enough. Let the users interact. Text does not matter. Meaning does.