Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Overview of text mining and NLP (+software)
1. Text mining and natural
language processing
Florian Leitner
Technical University of Madrid (UPM), Spain
!
Tyba
Madrid, ES, 12th of June, 2015
License:
2. Florian Leitner
Is language understanding & generation
key to artificial intelligence?
• “Her” (Samantha) Movie, 2013
• “The Singularity: ~2030”
Ray Kurzweil, Google’s director of engineering
• “Watson” & “CRUSH”
IBM’s bet on the future: Datastreams, Mainframes & AI
2
“predict crimes before they happen”
Criminal Reduction
Utilizing Statistical History
(IBM, reality)
!
Precogs
(Minority Report, movie)
if? when?
cognitive computing:
“processing information more like a
human than a machine”
GoogleGoogle
3. Florian Leitner
Examples of text mining and
natural language processing applications.
• Spam filtering
• Document classification
• Social media/brand monitoring
• Opinion mining (& text classification)
• Search engines
• Information retrieval
• Plagiarism detection
• Content-based recommendation systems
• Watson (Jeopardy!, IBM)
• Question answering
• Spelling correction
• Language modeling
• Website translation (Google)
• Machine translation
• Digital assistants (MS’ Clippy)
• Dialog systems (“Turing test”)
• Siri (Apple) and Google Now
• Speech recognit. & language understand.
• Event detection (in e-mails)
• Information extraction
3
TextMining
LanguageProcessing
Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)
5. Florian Leitner
Document and text
classification/clustering
5
1st Principal Component
2ndPrincipalComponent
document
distance
1st
Principal Component
2nd
PrincipalComponent
Centroid
Cluster
Supervised (“Learning to classify from examples”, e.g., spam filtering)
vs.
Unsupervised (“Exploratory grouping”, e.g., topic modeling)
LIBSVM
6. Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
“tokenization”
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Token or Shingle
7. Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
“tokenization”
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Snag: the terms “shingle”, “token” and “n-gram” are not used consistently…
but “n-gram” and “token” are far more common!
shingles
(unigrams)
2-shingles
(bigrams)
3-shingles
(trigrams)
“k-shingling”
e.g. all trigrams of the word “sentence”:
[sen, ent, nte, ten, enc, nce]
Token N-Grams
Character N-Grams
Token or Shingle
8. Florian Leitner
Lemmatization, Part-of-Speech (PoS) tagging, and
Named Entity Recognition (NER)
7
Token Lemma PoS NER
Constitutive constitutive JJ O
binding binding NN O
to to TO O
the the DT O
peri-! peri-kappa NN B-DNA
B B NN I-DNA
site site NN I-DNA
is be VBZ O
seen see VBN O
in in IN O
monocytes monocyte NNS B-cell
. . . O
de facto standard
PoS tagset
{NN, JJ, DT, VBZ, …}
Penn Treebank
B-I-O
chunk encoding
common
alternatives:
I-O
I-E-O
B-I-E-W-O
End token
(unigram) Word
Stanford CoreNLP FACTORIE and many more…
FreeLing
Linguistic annotations of tokens (used to train automated classifiers).
Begin-Inside-Outside
(relevant) token
}
chunk
9. Florian Leitner
Word vectors and inverted indices
8
0 1 2 3 4 5 6 7 8 9 10
10
0
1
2
3
4
5
6
7
8
9
count(Word1)
count(Word2)
Text1
Text2
α
γ
β
Similarity(T1
, T2
) := cos(T1
, T2
)
count(Word3
)
Comparing text vectors:
E.g., cosine similarity
Text vectorization:
Inverted index
Text 1: He that not wills to the end neither
wills to the means.
Text 2: If the mountain will not go to Moses,
then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1 INDRI
“Search engine basics”
eachtoken/wordisadimension!
10. Florian Leitner
Inverted indices and
the central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters
(θ)
“texts”(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
11. Florian Leitner
Inverted indices and
the central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters
(θ)
“texts”(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
“Nonparametric”
per instance
12. Florian Leitner
The curse of dimensionality
(R.E. Bellman, 1961) [inventor of dynamic programming]
• p ≫ n (far more tokens/features than texts/instances)
• Inverted indices (X) are (discrete) sparse matrices.
• Even with millions of training examples, unseen tokens will keep
popping up in during evaluation or in production.
‣ In such a high-dimensional hypercube, most instances are closer to
the face of the cube (“nothing”, outside) than other instances.
✓ Remedy: (feature) dimensionality reduction
The “blessing of non-uniformity.”
• feature extraction (compression): PCA/LSA (projection), factor analysis (regression),
compression, auto-encoders & deep learning (compression & embedding), …
• feature selection (elimination): LASSO (regularization), SVM (support vectors),
Bayesian nets (structure learning), locality sensitivity hashing, random projections, …
10
14. Florian Leitner
Google’s review summaries:
Opinion mining (“sentiment” analysis).
12
Don’t do it, please… ;-) (If you must: see document and text classification software.)
15. Florian Leitner
Polarity of sentiment keywords in IMDB.
• å
13
Cristopher Potts. On the negativity of negation. 2011
“not good”
16. Florian Leitner
Language understanding:
Parsing and semantic analysis.
14
disambiguation!
Coreference
(Anaphora)
Resolution
Named Entity
Recognition
Apple Siri
Stanford BLLIP (C-J) Malt LinkGrammar and many more…RedShift
Entity
Grounding
disambiguation!
disambiguation!
L. TesnièreN. Chomsky
17. Florian Leitner
Automatic text summarization:
Automatic text summarization:
• Variance/human agreement: When is a
summary “correct”?
• Coherence: providing discourse
structure (text flow) to the summary.
• Paraphrasing: important sentences are
repeated, but with different wordings.
• Implied messages: (the Dow Jones
index rose 10 points → the economy is
thriving)
• Anaphora (coreference) resolution:
very hard, but crucial.
15
…is very difficult because…
Image Source: www.lexalytics.com
Lex[Page]Rank (JUNG) sumy TextTeaser
the author got hired by Google…
18. Florian Leitner
Machine translation:
Deep learning with auto-encoders.
16
‣have only one gender (en) or use opposing genders
(es vs. de: el/die !; la/der "; …/das #)
‣have different verb placements (es⬌de).
‣have a different concepts of verbs (latin, arab, cjk).
‣use different tenses (en⬌de).
‣have different word orders (latin, arab, cjk).
Different languages…
DL4J
19. Florian Leitner
Question answering:
The champions league of TM & NLP.
17
Biggest issue: statistical inference
IBM Watson WolframAlpha
Category: Oscar Winning Movies
Hint: Its final scene includes the line “I
do wish we could chat longer, but I’m
having an old friend for dinner”
!
!
!
!
Answer: Silence of the Lamb
All men are mortal.
Socrates probably is a man…
…Therefore, Socrates
might be mortal.
(cognitive computing)
20. Florian Leitner
Information extraction:
Knowledge mining for molecular biology.
18
Biological
Repositories
Binary
Interactions
Named Entity
Recognition
Entity Associations
Entity Mapping
(Grounding)
Relationship
Extraction
Relationship
Annotations
Cdk5 Rat
TaxID
10116
UniProt
Q03114
Experimental
Methods
Article
Classification
Biological Model
Articles
Short Factoid
Question Answering
Ontologies & Thesauri
WWW
MITIE OpenDMAP ClearTK
21. Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
19
Anaphora resolution
Carl and Bob were fighting:
“You should shut up,”
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?
22. Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
20
Anaphora resolution
Carl and Bob were fighting:
“You should shut up,”
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?