SlideShare a Scribd company logo
1 of 28
Download to read offline
Text Mining 1
Introduction
!
9th
Madrid Summer School (2014) on
Advanced Statistics and Data Mining
!
Florian Leitner
florian.leitner@upm.es
License:
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
“Text Mining” or
“Text Analytics”
The discovery of {new or existing} facts by applying natural
language processing (“NLP”) & statistical learning techniques.
3
Machine
Learning
Inferential
Statistics
Computat.
Linguistics RulesModels
Predictions
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Language Understanding =
Artificial Intelligence ?
“Her” Movie, 2013
“Watson” & “CRUSH” IBM’s future bet: Mainframes & AI
“The Singularity” Ray Kurzweil

(Google’s director of engineering)
!
4
“predict crimes before they happen”
Criminal Reduction
Utilizing Statistical History
(IBM, reality)
!
Precogs
(Minority Report, movie)
if? when?
cognitive computing:
“processing information more like a
human than a machine”
GoogleGoogle
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Examples of Language
Processing Applications
Spam filtering
Document Classification
Date/time event detection
Information Extraction
(Web) Search engines
Information Retrieval
Watson in Jeopardy! (IBM)
Question Answering
Twitter brand monitoring
Sentiment Analysis (Stat. NLP)
Siri (Apple) and Google Now
Language Understanding
Spelling Correction
Statistical Language Modeling
Website translation (Google)
Machine Translation
“Clippy” Assistant (Microsoft)
Dialog System
Finding similar items (Amazon)
Recommender System
5
TextMining
LanguageProcessing
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Current Topics
in Text Mining
• Language Modeling
• String Processing
• Text Classification
• Information Extraction

• Information Retrieval
• Question Answering
• Dialogue Systems
• Text Summarization
• Machine Translation
• Language Understanding
6
Course requirements…
Basic Linear Algebra and Probability Theory; Computer Savvy
You will learn about… Other topics…
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Words, Tokens, Shingles,
and N-Grams
7
Text with words
Tokens
2-Shingles
3-Shingles
a.k.a. k-Shingling
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
“tokenization”
Character-based,
Regular Expressions,
Probabilistic, …
all trigrams of “sentence”:

[sen, ent, nte, ten, enc, nce]
Token N-Grams
Beware: the terms “k-shingle” and “n-gram” are not used consistently…
Character N-Grams
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Lemmatization, Part-of-Speech (PoS)
Tagging, and Named Entity
Recognition (NER)
8
Token Lemma PoS NER
Constitutive constitutive JJ O
binding binding NN O
to to TO O
the the DT O
peri-! peri-kappa NN B-DNA
B B NN I-DNA
site site NN I-DNA
is be VBZ O
seen see VBN O
in in IN O
monocytes monocyte NNS B-cell
. . . O
PoS Tagset:
Penn
Treebank
B-I-O
NER Tagging
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Information Retrieval (IR)
9
0 1 2 3 4 5 6 7 8 9 10
10
0
1
2
3
4
5
6
7
8
9
count(Word1)
count(Word2)
Text1
Text2
α
γ
β
Similarity(T1
, T2
) := cos(T1
, T2
)
count(Word3
)
Comparing Word Vectors:
Cosine Similarity
Text Vectorization:
Inverted Index
Text 1: He that not wills to the end neither
wills to the means.
Text 2: If the mountain will not go to Moses,
then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Document Classification
10
1st Principal Component
2ndPrincipalComponent
document
distance
1st
Principal Component
2nd
PrincipalComponent
Centroid
Cluster
Supervised (“Learning to Classify”, e.g., spam filtering)
vs.
Unsupervised (“Exploratory Grouping”, e.g., topic modeling)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Inverted (I-) Indices
11
Text 1: He that not wills to the end neither
wills to the means.
Text 2: If the mountain will not go to Moses,
then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1
unigrams T1
1
T2 p(T1) p(T2)
end 1 0 0.09 0.00
go 0 2 0.00 0.13
he 1 0 0.09 0.00
if 0 1 0.00 0.07
means 1 0 0.09 0.00
Moses 0 2 0.00 0.13
mountain 0 2 0.00 0.13
must 0 1 0.00 0.07
not 1 1 0.09 0.07
that 1 0 0.09 0.00
the 2 2 0.18 0.13
then 0 1 0.00 0.07
to 2 2 0.18 0.13
will 2 1 0.18 0.07
SUM 11 15 1.00 1.00
bigrams Text 1 Text 2
end, neither 1 0
go, to 0 2
he, that 1 0
if, the 0 1
Moses, must 0 1
Moses, then 0 1
mountain, will 0 1
must, go 0 1
not, go 0 1
not, will 1 0
that, not 1 0
the, means 1 0
the, mountain 0 2
then, Moses 0 1
to, Moses 0 1
to, the 2 1
will, not 0 1
will, to 2 0
factors, normalization (len[text]), probabilities, and n-grams
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
I-Indices and the Central
Dogma Machine Learning
12
×=
y = h✓(X)
Xy θ
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
I-Index
(transposed)
[Parameters]

(θ)
“texts”(n)
n-grams (p)
instances,
observations
variables,
features
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Curse of Dimensionality

(RE Bellman, 1961) [inventor of dynamic programming]
• p ≫ n (more tokens/n-grams/features than texts/documents)
• Inverted indices (X) are very sparse matrices.
• Even with millions of training examples, unseen tokens will
keep coming up in the “test set” or in production.
• In a high-dimensional hypercube, most instances are closer to
the face of the cube (“nothing”) than their nearest neighbor.
✓ Remedy: the “blessing of non-uniformity” ➡ dimensionality
reduction (a.k.a. [low-dimensional] embedding)
‣ feature extraction: PCA, LDA, factor analysis, unsupervised classification of
tokens based on their surrounding tokens (“word embedding”), …
‣ feature “reduction”: locality sensitivity hashing, random projections, …
13
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Sentiment Analysis
14
http://www.sentiment140.com
feelings are complex and
not black or white…
(irony, negations)
Information Extraction (IE)
“from unstructured (?) text to structured data”
15
NB: Information Retrieval (IR) ≠ IE
“from non-normalized text to connected data”
input (text) is structured output is structured, too!
Image Source: www.DARPA.mil
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Language Understanding
16
disambiguation!
Coreference
(Anaphora)
Resolution
Named Entity
Recognition
Parsing
Apple Siri
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Text Summarization
Variance/human agreement: When
is a summary “correct”?
Coherence: providing discourse
structure (text flow) to the
summary.
Paraphrasing: important sentences
are repeated, but with different
wordings.
Implied messages: (the Dow
Jones index rose 10 points → the
economy is thriving)
Anaphora (coreference)
resolution: very hard, but crucial.
17
…is hard because…
Image Source: www.lexalytics.com
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Question Answering
18
Biggest issue: very domain specific
IBM WolframAlpha
Category: Oscar Winning Movies
Hint: Its final scene includes the line “I
do wish we could chat longer, but I’m
having an old friend for dinner”
!
!
!
!
Answer: Silence of the Lamb
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Machine Translation
19
‣have no gender (en: the) or use different genders

(es/de: el/die !; la/der "; ??/das #)
‣have different verb placements (es⬌de).
‣have a different concept of verbs (latin, arab, cjk).
‣use different tenses (en⬌de).
‣have different word orders (latin, arab, cjk).
Languages…
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Ambiguity
20
Anaphora Resolution
Carl and Bob were fighting:
“You should shut up,”
Carl told him.
Part-of-Speech Tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Named Entity Recognition
Is Princeton really good for you?
It’s all in the semantics! (Or is it?)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Conditional Probability
for Dependent Events
21
Conditional Probability
P(X | Y) = P(X ∩ Y) ÷ P(Y)
*Independence
P(X ∩ Y) = P(X) × P(Y)
P(X | Y) = P(X)
P(Y | X) = P(Y)
` `
Joint Probability
P(X ∩ Y) = P(X, Y) = P(X × Y)
The multiplication principle
for dependent events*:
P(X ∩ Y) = P(Y) × P(X | Y)
and
therefore, by using a little algebra:
X Y
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Marginal, Conditional and
Joint Probabilities
22
Joint Probability*
P(xi, yj) = P(xi) × P(yj)
Marginal Probability
P(yi)
X=x X=x M
Y=
y
a/n =
P(x
b/n =
P(x
(a+b)/n
= P(y
Y=
y
c/n =
P(x
d/n =
P(x
(c+d)/n
= P(y
M
(a+c)/n
= P(x
(b+d)/n
= P(x
∑ / n = 1 = P(X) = P(Y)
Conditional Probability
P(xi | yj) = P(xi, yj) ÷ P(yj)
contingency table
*for independent events
variable/factor
variable/factor margin
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Bayes’ Rule:
Diachronic Interpretation
23
H - Hypothesis
D - Data
prior likelihood
posterior P(H|D) =
P(H) ⇥ P(D|H)
P(D)
“normalizing constant”
(law of total probability)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Bayes’ Rule:
The Monty Hall Problem
24
1 2 3
Prior
p(H)
1/3 1/3 1/3
Likelihood
p(D|H)
1/2 1 0
p(D)
=∑
p(H)*
p(D|H)
1/3×1/2
=1/6
1/3×1
=1/3
1/3×0
=0
1/6+1/3
=1/2
Posterior
p(H|D)
1/6÷1/2
=1/3
1/3÷1/2
=2/3
0÷1/2
=0
your pick
given the car is behind H=1, Monty Hall opens D=(1 or 2)
H=2
!
H=3
D=3
!
D=2
P(H|D) =
P(H) ⇥ P(D|H)
P(D)
practical use: a trickster hides a stone with three cups…
Images Source: Wikipedia, Monty Hall Problem, Cepheus
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
An Overview of Open Source
NLP Frameworks
• Natural Language ToolKit
‣ NLTK, Python
• General Architecture for Text
Engineering
‣ GATE, Java
• Stanford NLP Framework
‣ CoreNLP, Java
• Unstructured Information
Management Architecture
‣ UIMA, Java
‣ Many framework-sized sub-projects,
e.g., ClearNLP
• LingPipe Framework
‣ LingPipe, Java (OpenSource, but only
free for “non-commerical” use)
• FreeLing NLP Suite
‣ FreeLing, C++
• The Lemur Toolkit
‣ Lemur, C++ (IR + TextMining)
• The Bow Toolkit
‣ Bow, C (Language Modeling)
• DeepDive Inference Engine
‣ dp, Scala (+ SQL & Python)
25
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Practicals :: Setup
• Install Python, Numpy, SciPy,
matplotlib, pandas, and
IPython
‣ Via graphical installer:

http://continuum.io/downloads
• uses Continuum Analytics’ “Anaconda Python 2.0.x”,
anaconda [for Py2.7, recommended] or anaconda3
[for Py3.4; if you are progressive & “know thy snake”]
‣ Via command line: manual installation of
above packages for Py2.7 or 3.4
• http://fnl.es/installing-a-full-stack-python-data-
analysis-environment-on-osx.html

…but you’re on your own here!
• Install NLTK 2.x
‣ Natural Language Toolkit

http://www.nltk.org/install.html
• Via Anaconda (Py2.7 only): conda install nltk
• Default Python (Py2.7 only): pip install nltk
‣ or download 3-alpha (for Py3.4):
• http://www.nltk.org/nltk3-alpha
• Run in directory: python setup.py install
• Install SciKit-Learn 0.x
‣ http://scikit-learn.org/stable/install.html
• Via Anaconda: conda install sklearn
• Default Python: pip install sklearn
• [Install gensim (Py2.7 only)]
‣ http://radimrehurek.com/gensim
• Anaconda & Default Python: pip install gensim
26
Introduction to
IPython , NLTK,
NumPy, and SciPy
Ladies and Gentlemen, please start your engines!
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Chatty Chatterbots
Create two chat bots with NLTK and let them talk to each other,
printing each others answer on the screen.
http://www.nltk.org/api/nltk.chat.html
from nltk.chat import eliza; eliza.demo()
eliza??
from nltk.chat.util import 
Chat, reflections
from nltk.chat.eliza import pairs as eliza_pairs
eliza = Chat(eliza_pairs, reflections)
eliza.respond?
28
Isaac Asimov, ~1980 (?)
“I do not fear computers. I fear the lack of them.”

More Related Content

What's hot

Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoLidia Pivovarova
 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translationHiroshi Matsumoto
 
natural language processing
natural language processing natural language processing
natural language processing sunanthakrishnan
 

What's hot (14)

Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
AINL 2016: Maraev
AINL 2016: MaraevAINL 2016: Maraev
AINL 2016: Maraev
 
Esa act
Esa actEsa act
Esa act
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, Nikolenko
 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translation
 
natural language processing
natural language processing natural language processing
natural language processing
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 

Viewers also liked

Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Frank Oellien
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorialmgrcar
 
Text & Data Mining Licensing Issues
Text & Data Mining Licensing IssuesText & Data Mining Licensing Issues
Text & Data Mining Licensing IssuesDaniel Dollar
 
Information Extraction
Information ExtractionInformation Extraction
Information Extractionbutest
 
Text mining full text for molecular targets
Text mining full text for molecular targetsText mining full text for molecular targets
Text mining full text for molecular targetsAnn-Marie Roche
 
Aplicação de text mining
Aplicação de text miningAplicação de text mining
Aplicação de text miningJosias Oliveira
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehHadi Mohammadzadeh
 
تحلیل احساسات شبکه اجتماعی متن کاوی نظرکاوی حامد عزیزی تهران جنوب
تحلیل احساسات شبکه اجتماعی متن کاوی نظرکاوی حامد عزیزی تهران جنوبتحلیل احساسات شبکه اجتماعی متن کاوی نظرکاوی حامد عزیزی تهران جنوب
تحلیل احساسات شبکه اجتماعی متن کاوی نظرکاوی حامد عزیزی تهران جنوبHamed Azizi
 
Presentación Guadalajara #Tecnopoliticay15M
Presentación Guadalajara #Tecnopoliticay15MPresentación Guadalajara #Tecnopoliticay15M
Presentación Guadalajara #Tecnopoliticay15MJavier Toret Medina
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics PresentationSkylar Ritchie
 
Text Mining
Text MiningText Mining
Text Miningdp6
 

Viewers also liked (20)

Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Text mining
Text miningText mining
Text mining
 
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorial
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Text & Data Mining Licensing Issues
Text & Data Mining Licensing IssuesText & Data Mining Licensing Issues
Text & Data Mining Licensing Issues
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Week12
Week12Week12
Week12
 
Text mining full text for molecular targets
Text mining full text for molecular targetsText mining full text for molecular targets
Text mining full text for molecular targets
 
Campus Party2010
Campus Party2010Campus Party2010
Campus Party2010
 
Aplicação de text mining
Aplicação de text miningAplicação de text mining
Aplicação de text mining
 
J15 45 peset_fernanda
J15 45 peset_fernandaJ15 45 peset_fernanda
J15 45 peset_fernanda
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi Mohammadzadeh
 
تحلیل احساسات شبکه اجتماعی متن کاوی نظرکاوی حامد عزیزی تهران جنوب
تحلیل احساسات شبکه اجتماعی متن کاوی نظرکاوی حامد عزیزی تهران جنوبتحلیل احساسات شبکه اجتماعی متن کاوی نظرکاوی حامد عزیزی تهران جنوب
تحلیل احساسات شبکه اجتماعی متن کاوی نظرکاوی حامد عزیزی تهران جنوب
 
Presentación Guadalajara #Tecnopoliticay15M
Presentación Guadalajara #Tecnopoliticay15MPresentación Guadalajara #Tecnopoliticay15M
Presentación Guadalajara #Tecnopoliticay15M
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Text Mining - Data Mining
Text Mining - Data MiningText Mining - Data Mining
Text Mining - Data Mining
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics Presentation
 
Text Mining
Text MiningText Mining
Text Mining
 

Recently uploaded

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 

Recently uploaded (20)

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

OUTDATED Text Mining 1/5: Introduction

  • 1. Text Mining 1 Introduction ! 9th Madrid Summer School (2014) on Advanced Statistics and Data Mining ! Florian Leitner florian.leitner@upm.es License:
  • 2. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining “Text Mining” or “Text Analytics” The discovery of {new or existing} facts by applying natural language processing (“NLP”) & statistical learning techniques. 3 Machine Learning Inferential Statistics Computat. Linguistics RulesModels Predictions
  • 3. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Language Understanding = Artificial Intelligence ? “Her” Movie, 2013 “Watson” & “CRUSH” IBM’s future bet: Mainframes & AI “The Singularity” Ray Kurzweil
 (Google’s director of engineering) ! 4 “predict crimes before they happen” Criminal Reduction Utilizing Statistical History (IBM, reality) ! Precogs (Minority Report, movie) if? when? cognitive computing: “processing information more like a human than a machine” GoogleGoogle
  • 4. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Examples of Language Processing Applications Spam filtering Document Classification Date/time event detection Information Extraction (Web) Search engines Information Retrieval Watson in Jeopardy! (IBM) Question Answering Twitter brand monitoring Sentiment Analysis (Stat. NLP) Siri (Apple) and Google Now Language Understanding Spelling Correction Statistical Language Modeling Website translation (Google) Machine Translation “Clippy” Assistant (Microsoft) Dialog System Finding similar items (Amazon) Recommender System 5 TextMining LanguageProcessing
  • 5. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Current Topics in Text Mining • Language Modeling • String Processing • Text Classification • Information Extraction
 • Information Retrieval • Question Answering • Dialogue Systems • Text Summarization • Machine Translation • Language Understanding 6 Course requirements… Basic Linear Algebra and Probability Theory; Computer Savvy You will learn about… Other topics…
  • 6. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Words, Tokens, Shingles, and N-Grams 7 Text with words Tokens 2-Shingles 3-Shingles a.k.a. k-Shingling This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: “tokenization” Character-based, Regular Expressions, Probabilistic, … all trigrams of “sentence”:
 [sen, ent, nte, ten, enc, nce] Token N-Grams Beware: the terms “k-shingle” and “n-gram” are not used consistently… Character N-Grams
  • 7. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Lemmatization, Part-of-Speech (PoS) Tagging, and Named Entity Recognition (NER) 8 Token Lemma PoS NER Constitutive constitutive JJ O binding binding NN O to to TO O the the DT O peri-! peri-kappa NN B-DNA B B NN I-DNA site site NN I-DNA is be VBZ O seen see VBN O in in IN O monocytes monocyte NNS B-cell . . . O PoS Tagset: Penn Treebank B-I-O NER Tagging
  • 8. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Information Retrieval (IR) 9 0 1 2 3 4 5 6 7 8 9 10 10 0 1 2 3 4 5 6 7 8 9 count(Word1) count(Word2) Text1 Text2 α γ β Similarity(T1 , T2 ) := cos(T1 , T2 ) count(Word3 ) Comparing Word Vectors: Cosine Similarity Text Vectorization: Inverted Index Text 1: He that not wills to the end neither wills to the means. Text 2: If the mountain will not go to Moses, then Moses must go to the mountain. tokens Text 1 Text 2 end 1 0 go 0 2 he 1 0 if 0 1 means 1 0 Moses 0 2 mountain 0 2 must 0 1 not 1 1 that 1 0 the 2 2 then 0 1 to 2 2 will 2 1
  • 9. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Document Classification 10 1st Principal Component 2ndPrincipalComponent document distance 1st Principal Component 2nd PrincipalComponent Centroid Cluster Supervised (“Learning to Classify”, e.g., spam filtering) vs. Unsupervised (“Exploratory Grouping”, e.g., topic modeling)
  • 10. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Inverted (I-) Indices 11 Text 1: He that not wills to the end neither wills to the means. Text 2: If the mountain will not go to Moses, then Moses must go to the mountain. tokens Text 1 Text 2 end 1 0 go 0 2 he 1 0 if 0 1 means 1 0 Moses 0 2 mountain 0 2 must 0 1 not 1 1 that 1 0 the 2 2 then 0 1 to 2 2 will 2 1 unigrams T1 1 T2 p(T1) p(T2) end 1 0 0.09 0.00 go 0 2 0.00 0.13 he 1 0 0.09 0.00 if 0 1 0.00 0.07 means 1 0 0.09 0.00 Moses 0 2 0.00 0.13 mountain 0 2 0.00 0.13 must 0 1 0.00 0.07 not 1 1 0.09 0.07 that 1 0 0.09 0.00 the 2 2 0.18 0.13 then 0 1 0.00 0.07 to 2 2 0.18 0.13 will 2 1 0.18 0.07 SUM 11 15 1.00 1.00 bigrams Text 1 Text 2 end, neither 1 0 go, to 0 2 he, that 1 0 if, the 0 1 Moses, must 0 1 Moses, then 0 1 mountain, will 0 1 must, go 0 1 not, go 0 1 not, will 1 0 that, not 1 0 the, means 1 0 the, mountain 0 2 then, Moses 0 1 to, Moses 0 1 to, the 2 1 will, not 0 1 will, to 2 0 factors, normalization (len[text]), probabilities, and n-grams
  • 11. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining I-Indices and the Central Dogma Machine Learning 12 ×= y = h✓(X) Xy θ Rank, Class, Expectation, Probability, Descriptor*, … I-Index (transposed) [Parameters]
 (θ) “texts”(n) n-grams (p) instances, observations variables, features
  • 12. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining The Curse of Dimensionality
 (RE Bellman, 1961) [inventor of dynamic programming] • p ≫ n (more tokens/n-grams/features than texts/documents) • Inverted indices (X) are very sparse matrices. • Even with millions of training examples, unseen tokens will keep coming up in the “test set” or in production. • In a high-dimensional hypercube, most instances are closer to the face of the cube (“nothing”) than their nearest neighbor. ✓ Remedy: the “blessing of non-uniformity” ➡ dimensionality reduction (a.k.a. [low-dimensional] embedding) ‣ feature extraction: PCA, LDA, factor analysis, unsupervised classification of tokens based on their surrounding tokens (“word embedding”), … ‣ feature “reduction”: locality sensitivity hashing, random projections, … 13
  • 13. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Sentiment Analysis 14 http://www.sentiment140.com feelings are complex and not black or white… (irony, negations)
  • 14. Information Extraction (IE) “from unstructured (?) text to structured data” 15 NB: Information Retrieval (IR) ≠ IE “from non-normalized text to connected data” input (text) is structured output is structured, too! Image Source: www.DARPA.mil
  • 15. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Language Understanding 16 disambiguation! Coreference (Anaphora) Resolution Named Entity Recognition Parsing Apple Siri
  • 16. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Text Summarization Variance/human agreement: When is a summary “correct”? Coherence: providing discourse structure (text flow) to the summary. Paraphrasing: important sentences are repeated, but with different wordings. Implied messages: (the Dow Jones index rose 10 points → the economy is thriving) Anaphora (coreference) resolution: very hard, but crucial. 17 …is hard because… Image Source: www.lexalytics.com
  • 17. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Question Answering 18 Biggest issue: very domain specific IBM WolframAlpha Category: Oscar Winning Movies Hint: Its final scene includes the line “I do wish we could chat longer, but I’m having an old friend for dinner” ! ! ! ! Answer: Silence of the Lamb
  • 18. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Machine Translation 19 ‣have no gender (en: the) or use different genders
 (es/de: el/die !; la/der "; ??/das #) ‣have different verb placements (es⬌de). ‣have a different concept of verbs (latin, arab, cjk). ‣use different tenses (en⬌de). ‣have different word orders (latin, arab, cjk). Languages…
  • 19. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Ambiguity 20 Anaphora Resolution Carl and Bob were fighting: “You should shut up,” Carl told him. Part-of-Speech Tagging The robot wheels out the iron. Paraphrasing Unemployment is on the rise. vs The economy is slumping. Named Entity Recognition Is Princeton really good for you? It’s all in the semantics! (Or is it?)
  • 20. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining The Conditional Probability for Dependent Events 21 Conditional Probability P(X | Y) = P(X ∩ Y) ÷ P(Y) *Independence P(X ∩ Y) = P(X) × P(Y) P(X | Y) = P(X) P(Y | X) = P(Y) ` ` Joint Probability P(X ∩ Y) = P(X, Y) = P(X × Y) The multiplication principle for dependent events*: P(X ∩ Y) = P(Y) × P(X | Y) and therefore, by using a little algebra: X Y
  • 21. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Marginal, Conditional and Joint Probabilities 22 Joint Probability* P(xi, yj) = P(xi) × P(yj) Marginal Probability P(yi) X=x X=x M Y= y a/n = P(x b/n = P(x (a+b)/n = P(y Y= y c/n = P(x d/n = P(x (c+d)/n = P(y M (a+c)/n = P(x (b+d)/n = P(x ∑ / n = 1 = P(X) = P(Y) Conditional Probability P(xi | yj) = P(xi, yj) ÷ P(yj) contingency table *for independent events variable/factor variable/factor margin
  • 22. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Bayes’ Rule: Diachronic Interpretation 23 H - Hypothesis D - Data prior likelihood posterior P(H|D) = P(H) ⇥ P(D|H) P(D) “normalizing constant” (law of total probability)
  • 23. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Bayes’ Rule: The Monty Hall Problem 24 1 2 3 Prior p(H) 1/3 1/3 1/3 Likelihood p(D|H) 1/2 1 0 p(D) =∑ p(H)* p(D|H) 1/3×1/2 =1/6 1/3×1 =1/3 1/3×0 =0 1/6+1/3 =1/2 Posterior p(H|D) 1/6÷1/2 =1/3 1/3÷1/2 =2/3 0÷1/2 =0 your pick given the car is behind H=1, Monty Hall opens D=(1 or 2) H=2 ! H=3 D=3 ! D=2 P(H|D) = P(H) ⇥ P(D|H) P(D) practical use: a trickster hides a stone with three cups… Images Source: Wikipedia, Monty Hall Problem, Cepheus
  • 24. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining An Overview of Open Source NLP Frameworks • Natural Language ToolKit ‣ NLTK, Python • General Architecture for Text Engineering ‣ GATE, Java • Stanford NLP Framework ‣ CoreNLP, Java • Unstructured Information Management Architecture ‣ UIMA, Java ‣ Many framework-sized sub-projects, e.g., ClearNLP • LingPipe Framework ‣ LingPipe, Java (OpenSource, but only free for “non-commerical” use) • FreeLing NLP Suite ‣ FreeLing, C++ • The Lemur Toolkit ‣ Lemur, C++ (IR + TextMining) • The Bow Toolkit ‣ Bow, C (Language Modeling) • DeepDive Inference Engine ‣ dp, Scala (+ SQL & Python) 25
  • 25. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Practicals :: Setup • Install Python, Numpy, SciPy, matplotlib, pandas, and IPython ‣ Via graphical installer:
 http://continuum.io/downloads • uses Continuum Analytics’ “Anaconda Python 2.0.x”, anaconda [for Py2.7, recommended] or anaconda3 [for Py3.4; if you are progressive & “know thy snake”] ‣ Via command line: manual installation of above packages for Py2.7 or 3.4 • http://fnl.es/installing-a-full-stack-python-data- analysis-environment-on-osx.html
 …but you’re on your own here! • Install NLTK 2.x ‣ Natural Language Toolkit
 http://www.nltk.org/install.html • Via Anaconda (Py2.7 only): conda install nltk • Default Python (Py2.7 only): pip install nltk ‣ or download 3-alpha (for Py3.4): • http://www.nltk.org/nltk3-alpha • Run in directory: python setup.py install • Install SciKit-Learn 0.x ‣ http://scikit-learn.org/stable/install.html • Via Anaconda: conda install sklearn • Default Python: pip install sklearn • [Install gensim (Py2.7 only)] ‣ http://radimrehurek.com/gensim • Anaconda & Default Python: pip install gensim 26
  • 26. Introduction to IPython , NLTK, NumPy, and SciPy Ladies and Gentlemen, please start your engines!
  • 27. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Chatty Chatterbots Create two chat bots with NLTK and let them talk to each other, printing each others answer on the screen. http://www.nltk.org/api/nltk.chat.html from nltk.chat import eliza; eliza.demo() eliza?? from nltk.chat.util import Chat, reflections from nltk.chat.eliza import pairs as eliza_pairs eliza = Chat(eliza_pairs, reflections) eliza.respond? 28
  • 28. Isaac Asimov, ~1980 (?) “I do not fear computers. I fear the lack of them.”