Overview of text mining and NLP (+software)

Text mining and natural
language processing
Florian Leitner

Technical University of Madrid (UPM), Spain

!
Tyba

Madrid, ES, 12th of June, 2015
License:

Florian Leitner
Is language understanding & generation 
key to artiﬁcial intelligence?
• “Her” (Samantha) Movie, 2013

• “The Singularity: ~2030” 
Ray Kurzweil, Google’s director of engineering

• “Watson” & “CRUSH” 
IBM’s bet on the future: Datastreams, Mainframes & AI
2
“predict crimes before they happen”
Criminal Reduction
Utilizing Statistical History
(IBM, reality)
!
Precogs
(Minority Report, movie)
if? when?
cognitive computing:
“processing information more like a
human than a machine”
GoogleGoogle

Florian Leitner
Examples of text mining and 
natural language processing applications.
• Spam filtering

• Document classification

• Social media/brand monitoring

• Opinion mining (& text classification)

• Search engines

• Information retrieval

• Plagiarism detection

• Content-based recommendation systems

• Watson (Jeopardy!, IBM)

• Question answering

• Spelling correction

• Language modeling

• Website translation (Google)

• Machine translation

• Digital assistants (MS’ Clippy)

• Dialog systems (“Turing test”)

• Siri (Apple) and Google Now

• Speech recognit. & language understand.

• Event detection (in e-mails)

• Information extraction
3
TextMining
LanguageProcessing
Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)

Florian Leitner
Document and text 
classiﬁcation/clustering
5
1st Principal Component
2ndPrincipalComponent
document
distance
1st
Principal Component
2nd
PrincipalComponent
Centroid
Cluster
Supervised (“Learning to classify from examples”, e.g., spam ﬁltering)

vs.

Unsupervised (“Exploratory grouping”, e.g., topic modeling)
LIBSVM

Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
“tokenization”
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Token or Shingle

Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
“tokenization”
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Snag: the terms “shingle”, “token” and “n-gram” are not used consistently…
but “n-gram” and “token” are far more common!
shingles
(unigrams)
2-shingles
(bigrams)
3-shingles
(trigrams)
“k-shingling”
e.g. all trigrams of the word “sentence”: 
[sen, ent, nte, ten, enc, nce]
Token N-Grams
Character N-Grams
Token or Shingle

Florian Leitner
Lemmatization, Part-of-Speech (PoS) tagging, and
Named Entity Recognition (NER)
7
Token Lemma PoS NER
Constitutive constitutive JJ O
binding binding NN O
to to TO O
the the DT O
peri-! peri-kappa NN B-DNA
B B NN I-DNA
site site NN I-DNA
is be VBZ O
seen see VBN O
in in IN O
monocytes monocyte NNS B-cell
. . . O
de facto standard 
PoS tagset

{NN, JJ, DT, VBZ, …}

Penn Treebank
B-I-O
chunk encoding
common

alternatives:

I-O

I-E-O

B-I-E-W-O
End token
(unigram) Word
Stanford CoreNLP FACTORIE and many more…
FreeLing
Linguistic annotations of tokens (used to train automated classiﬁers).
Begin-Inside-Outside
(relevant) token
}
chunk

Florian Leitner
Word vectors and inverted indices
8
0 1 2 3 4 5 6 7 8 9 10
10
0
1
2
3
4
5
6
7
8
9
count(Word1)
count(Word2)
Text1
Text2
α
γ
β
Similarity(T1
, T2
) := cos(T1
, T2
)
count(Word3
)
Comparing text vectors:

E.g., cosine similarity
Text vectorization:

Inverted index
Text 1: He that not wills to the end neither

wills to the means.

Text 2: If the mountain will not go to Moses,

then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1 INDRI
“Search engine basics”
eachtoken/wordisadimension!

Florian Leitner
Inverted indices and 
the central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters 
(θ)
“texts”(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature

Florian Leitner
Inverted indices and 
the central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters 
(θ)
“texts”(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
“Nonparametric”
per instance

Florian Leitner
The curse of dimensionality 
(R.E. Bellman, 1961) [inventor of dynamic programming]
• p ≫ n (far more tokens/features than texts/instances)

• Inverted indices (X) are (discrete) sparse matrices.

• Even with millions of training examples, unseen tokens will keep
popping up in during evaluation or in production.

‣ In such a high-dimensional hypercube, most instances are closer to
the face of the cube (“nothing”, outside) than other instances.

✓ Remedy: (feature) dimensionality reduction 
The “blessing of non-uniformity.”

• feature extraction (compression): PCA/LSA (projection), factor analysis (regression),
compression, auto-encoders & deep learning (compression & embedding), …

• feature selection (elimination): LASSO (regularization), SVM (support vectors),
Bayesian nets (structure learning), locality sensitivity hashing, random projections, …
10

Florian Leitner
Google’s review summaries: 
Opinion mining (“sentiment” analysis).
12
Don’t do it, please… ;-) (If you must: see document and text classiﬁcation software.)

Florian Leitner
Polarity of sentiment keywords in IMDB.
• å
13
Cristopher Potts. On the negativity of negation. 2011
“not good”

Florian Leitner
Language understanding:
Parsing and semantic analysis.
14
disambiguation!
Coreference
(Anaphora)
Resolution
Named Entity
Recognition
Apple Siri
Stanford BLLIP (C-J) Malt LinkGrammar and many more…RedShift
Entity
Grounding
disambiguation!
disambiguation!
L. TesnièreN. Chomsky

Florian Leitner
Automatic text summarization:
Automatic text summarization:
• Variance/human agreement: When is a
summary “correct”?

• Coherence: providing discourse
structure (text ﬂow) to the summary.

• Paraphrasing: important sentences are
repeated, but with diﬀerent wordings.

• Implied messages: (the Dow Jones
index rose 10 points → the economy is
thriving)

• Anaphora (coreference) resolution:
very hard, but crucial.
15
…is very difficult because…
Image Source: www.lexalytics.com
Lex[Page]Rank (JUNG) sumy TextTeaser
the author got hired by Google…

Florian Leitner
Machine translation:
Deep learning with auto-encoders.
16
‣have only one gender (en) or use opposing genders 
(es vs. de: el/die !; la/der "; …/das #)
‣have different verb placements (es⬌de).
‣have a different concepts of verbs (latin, arab, cjk).
‣use different tenses (en⬌de).
‣have different word orders (latin, arab, cjk).
Different languages…
DL4J

Florian Leitner
Question answering:
The champions league of TM & NLP.
17
Biggest issue: statistical inference
IBM Watson WolframAlpha
Category: Oscar Winning Movies
Hint: Its ﬁnal scene includes the line “I
do wish we could chat longer, but I’m
having an old friend for dinner”
!
!
!
!
Answer: Silence of the Lamb
All men are mortal.

Socrates probably is a man…
…Therefore, Socrates

might be mortal.
(cognitive computing)

Florian Leitner
Information extraction:
Knowledge mining for molecular biology.
18
Biological
Repositories
Binary
Interactions
Named Entity
Recognition
Entity Associations
Entity Mapping
(Grounding)
Relationship
Extraction
Relationship
Annotations
Cdk5 Rat
TaxID
10116
UniProt
Q03114
Experimental
Methods
Article
Classiﬁcation
Biological Model
Articles
Short Factoid
Question Answering
Ontologies & Thesauri
WWW
MITIE OpenDMAP ClearTK

Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
19
Anaphora resolution
Carl and Bob were fighting:
“You should shut up,”
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?

Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
20
Anaphora resolution
Carl and Bob were fighting:
“You should shut up,”
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?

Overview of text mining and NLP (+software)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Overview of text mining and NLP (+software)

Similar to Overview of text mining and NLP (+software) (20)

Recently uploaded

Recently uploaded (20)

Overview of text mining and NLP (+software)