SlideShare a Scribd company logo
1 of 22
Download to read offline
Text mining and natural
language processing
Florian Leitner

Technical University of Madrid (UPM), Spain

!
Tyba

Madrid, ES, 12th of June, 2015
License:
Florian Leitner
Is language understanding & generation

key to artificial intelligence?
• “Her” (Samantha) Movie, 2013

• “The Singularity: ~2030”

Ray Kurzweil, Google’s director of engineering

• “Watson” & “CRUSH”

IBM’s bet on the future: Datastreams, Mainframes & AI
2
“predict crimes before they happen”
Criminal Reduction
Utilizing Statistical History
(IBM, reality)
!
Precogs
(Minority Report, movie)
if? when?
cognitive computing:
“processing information more like a
human than a machine”
GoogleGoogle
Florian Leitner
Examples of text mining and

natural language processing applications.
• Spam filtering

• Document classification

• Social media/brand monitoring

• Opinion mining (& text classification)

• Search engines

• Information retrieval

• Plagiarism detection

• Content-based recommendation systems

• Watson (Jeopardy!, IBM)

• Question answering

• Spelling correction

• Language modeling

• Website translation (Google)

• Machine translation

• Digital assistants (MS’ Clippy)

• Dialog systems (“Turing test”)

• Siri (Apple) and Google Now

• Speech recognit. & language understand.

• Event detection (in e-mails)

• Information extraction
3
TextMining
LanguageProcessing
Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)
Concepts & Terminology
Florian Leitner
Document and text

classification/clustering
5
1st Principal Component
2ndPrincipalComponent
document
distance
1st
Principal Component
2nd
PrincipalComponent
Centroid
Cluster
Supervised (“Learning to classify from examples”, e.g., spam filtering)

vs.

Unsupervised (“Exploratory grouping”, e.g., topic modeling)
LIBSVM
Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
“tokenization”
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Token or Shingle
Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
“tokenization”
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Snag: the terms “shingle”, “token” and “n-gram” are not used consistently…
but “n-gram” and “token” are far more common!
shingles
(unigrams)
2-shingles
(bigrams)
3-shingles
(trigrams)
“k-shingling”
e.g. all trigrams of the word “sentence”:

[sen, ent, nte, ten, enc, nce]
Token N-Grams
Character N-Grams
Token or Shingle
Florian Leitner
Lemmatization, Part-of-Speech (PoS) tagging, and
Named Entity Recognition (NER)
7
Token Lemma PoS NER
Constitutive constitutive JJ O
binding binding NN O
to to TO O
the the DT O
peri-! peri-kappa NN B-DNA
B B NN I-DNA
site site NN I-DNA
is be VBZ O
seen see VBN O
in in IN O
monocytes monocyte NNS B-cell
. . . O
de facto standard

PoS tagset

{NN, JJ, DT, VBZ, …}

Penn Treebank
B-I-O
chunk encoding
common

alternatives:

I-O

I-E-O

B-I-E-W-O
End token
(unigram) Word
Stanford CoreNLP FACTORIE and many more…
FreeLing
Linguistic annotations of tokens (used to train automated classifiers).
Begin-Inside-Outside
(relevant) token
}
chunk
Florian Leitner
Word vectors and inverted indices
8
0 1 2 3 4 5 6 7 8 9 10
10
0
1
2
3
4
5
6
7
8
9
count(Word1)
count(Word2)
Text1
Text2
α
γ
β
Similarity(T1
, T2
) := cos(T1
, T2
)
count(Word3
)
Comparing text vectors:

E.g., cosine similarity
Text vectorization:

Inverted index
Text 1: He that not wills to the end neither

wills to the means.

Text 2: If the mountain will not go to Moses,

then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1 INDRI
“Search engine basics”
eachtoken/wordisadimension!
Florian Leitner
Inverted indices and

the central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters

(θ)
“texts”(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
Florian Leitner
Inverted indices and

the central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters

(θ)
“texts”(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
“Nonparametric”
per instance
Florian Leitner
The curse of dimensionality

(R.E. Bellman, 1961) [inventor of dynamic programming]
• p ≫ n (far more tokens/features than texts/instances)

• Inverted indices (X) are (discrete) sparse matrices.

• Even with millions of training examples, unseen tokens will keep
popping up in during evaluation or in production.

‣ In such a high-dimensional hypercube, most instances are closer to
the face of the cube (“nothing”, outside) than other instances.

✓ Remedy: (feature) dimensionality reduction

The “blessing of non-uniformity.”

• feature extraction (compression): PCA/LSA (projection), factor analysis (regression),
compression, auto-encoders & deep learning (compression & embedding), …

• feature selection (elimination): LASSO (regularization), SVM (support vectors),
Bayesian nets (structure learning), locality sensitivity hashing, random projections, …
10
Applications
Florian Leitner
Google’s review summaries:

Opinion mining (“sentiment” analysis).
12
Don’t do it, please… ;-) (If you must: see document and text classification software.)
Florian Leitner
Polarity of sentiment keywords in IMDB.
• å
13
Cristopher Potts. On the negativity of negation. 2011
“not good”
Florian Leitner
Language understanding:
Parsing and semantic analysis.
14
disambiguation!
Coreference
(Anaphora)
Resolution
Named Entity
Recognition
Apple Siri
Stanford BLLIP (C-J) Malt LinkGrammar and many more…RedShift
Entity
Grounding
disambiguation!
disambiguation!
L. TesnièreN. Chomsky
Florian Leitner
Automatic text summarization:
Automatic text summarization:
• Variance/human agreement: When is a
summary “correct”?

• Coherence: providing discourse
structure (text flow) to the summary.

• Paraphrasing: important sentences are
repeated, but with different wordings.

• Implied messages: (the Dow Jones
index rose 10 points → the economy is
thriving)

• Anaphora (coreference) resolution:
very hard, but crucial.
15
…is very difficult because…
Image Source: www.lexalytics.com
Lex[Page]Rank (JUNG) sumy TextTeaser
the author got hired by Google…
Florian Leitner
Machine translation:
Deep learning with auto-encoders.
16
‣have only one gender (en) or use opposing genders

(es vs. de: el/die !; la/der "; …/das #)
‣have different verb placements (es⬌de).
‣have a different concepts of verbs (latin, arab, cjk).
‣use different tenses (en⬌de).
‣have different word orders (latin, arab, cjk).
Different languages…
DL4J
Florian Leitner
Question answering:
The champions league of TM & NLP.
17
Biggest issue: statistical inference
IBM Watson WolframAlpha
Category: Oscar Winning Movies
Hint: Its final scene includes the line “I
do wish we could chat longer, but I’m
having an old friend for dinner”
!
!
!
!
Answer: Silence of the Lamb
All men are mortal.

Socrates probably is a man…
…Therefore, Socrates

might be mortal.
(cognitive computing)
Florian Leitner
Information extraction:
Knowledge mining for molecular biology.
18
Biological
Repositories
Binary
Interactions
Named Entity
Recognition
Entity Associations
Entity Mapping
(Grounding)
Relationship
Extraction
Relationship
Annotations
Cdk5 Rat
TaxID
10116
UniProt
Q03114
Experimental
Methods
Article
Classification
Biological Model
Articles
Short Factoid
Question Answering
Ontologies & Thesauri
WWW
MITIE OpenDMAP ClearTK
Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
19
Anaphora resolution
Carl and Bob were fighting:
“You should shut up,”
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?
Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
20
Anaphora resolution
Carl and Bob were fighting:
“You should shut up,”
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?

More Related Content

What's hot

Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoLidia Pivovarova
 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translationHiroshi Matsumoto
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
AINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, NikolenkoAINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, NikolenkoLidia Pivovarova
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)Nitesh Singh
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...Francisco Manuel Rangel Pardo
 
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsCodeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsTobias Kuhn
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 

What's hot (20)

Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
AINL 2016: Maraev
AINL 2016: MaraevAINL 2016: Maraev
AINL 2016: Maraev
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, Nikolenko
 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translation
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
AINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, NikolenkoAINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, Nikolenko
 
Esa act
Esa actEsa act
Esa act
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsCodeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 

Viewers also liked

Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Chi-Yi Kuan
 
Aplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESANAplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESANYabed Contreras Zambrano
 
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣAndreas Iatridis
 
Python + NoSQL in Animations
Python + NoSQL in AnimationsPython + NoSQL in Animations
Python + NoSQL in AnimationsShuen-Huei Guan
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
Text mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsingText mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsingFlorian Leitner
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text MiningHemant Sharma
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTKFrancesco Bruni
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Yuliya Rubtsova
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language ProcessingJaganadh Gopinadhan
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 

Viewers also liked (20)

Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
 
Aplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESANAplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESAN
 
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
 
Python + NoSQL in Animations
Python + NoSQL in AnimationsPython + NoSQL in Animations
Python + NoSQL in Animations
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Yahoo answers
Yahoo answersYahoo answers
Yahoo answers
 
Text mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsingText mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsing
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTK
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Text mining
Text miningText mining
Text mining
 
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 

Similar to Overview of text mining and NLP (+software)

Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Lucidworks
 
ODSC London 2018
ODSC London 2018ODSC London 2018
ODSC London 2018Kfir Bar
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonAlyona Medelyan
 
PPT slides
PPT slidesPPT slides
PPT slidesbutest
 
Smart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingSmart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingDATAVERSITY
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
F# Eye for the C# Guy
F# Eye for the C# GuyF# Eye for the C# Guy
F# Eye for the C# Guygueste3f83d
 
Machine reading for the Semantic Web
Machine reading for the Semantic WebMachine reading for the Semantic Web
Machine reading for the Semantic WebSTLab
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesZoltan Varju
 
Data Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docxData Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docxtheodorelove43763
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...MITRE - ATT&CKcon
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryAniruddha Chakrabarti
 
Lean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicLean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicValeria de Paiva
 

Similar to Overview of text mining and NLP (+software) (20)

Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
 
ODSC London 2018
ODSC London 2018ODSC London 2018
ODSC London 2018
 
Weakly supervised learning
Weakly supervised learningWeakly supervised learning
Weakly supervised learning
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with Python
 
PPT slides
PPT slidesPPT slides
PPT slides
 
Smart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingSmart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language Processing
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
F# Eye for the C# Guy
F# Eye for the C# GuyF# Eye for the C# Guy
F# Eye for the C# Guy
 
Machine reading for the Semantic Web
Machine reading for the Semantic WebMachine reading for the Semantic Web
Machine reading for the Semantic Web
 
Nltk
NltkNltk
Nltk
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
 
Data Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docxData Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docx
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural Library
 
Lean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicLean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural Logic
 
Angular and Deep Learning
Angular and Deep LearningAngular and Deep Learning
Angular and Deep Learning
 

Recently uploaded

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 

Overview of text mining and NLP (+software)

  • 1. Text mining and natural language processing Florian Leitner Technical University of Madrid (UPM), Spain ! Tyba Madrid, ES, 12th of June, 2015 License:
  • 2. Florian Leitner Is language understanding & generation
 key to artificial intelligence? • “Her” (Samantha) Movie, 2013 • “The Singularity: ~2030”
 Ray Kurzweil, Google’s director of engineering • “Watson” & “CRUSH”
 IBM’s bet on the future: Datastreams, Mainframes & AI 2 “predict crimes before they happen” Criminal Reduction Utilizing Statistical History (IBM, reality) ! Precogs (Minority Report, movie) if? when? cognitive computing: “processing information more like a human than a machine” GoogleGoogle
  • 3. Florian Leitner Examples of text mining and
 natural language processing applications. • Spam filtering • Document classification • Social media/brand monitoring • Opinion mining (& text classification) • Search engines • Information retrieval • Plagiarism detection • Content-based recommendation systems • Watson (Jeopardy!, IBM) • Question answering • Spelling correction • Language modeling • Website translation (Google) • Machine translation • Digital assistants (MS’ Clippy) • Dialog systems (“Turing test”) • Siri (Apple) and Google Now • Speech recognit. & language understand. • Event detection (in e-mails) • Information extraction 3 TextMining LanguageProcessing Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)
  • 5. Florian Leitner Document and text
 classification/clustering 5 1st Principal Component 2ndPrincipalComponent document distance 1st Principal Component 2nd PrincipalComponent Centroid Cluster Supervised (“Learning to classify from examples”, e.g., spam filtering) vs. Unsupervised (“Exploratory grouping”, e.g., topic modeling) LIBSVM
  • 6. Florian Leitner Words, Tokens, and N-Grams/Shingles 6 This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: “tokenization” Splitting: Character-based, Regular Expressions, Probabilistic, … Token or Shingle
  • 7. Florian Leitner Words, Tokens, and N-Grams/Shingles 6 This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: “tokenization” Splitting: Character-based, Regular Expressions, Probabilistic, … Snag: the terms “shingle”, “token” and “n-gram” are not used consistently… but “n-gram” and “token” are far more common! shingles (unigrams) 2-shingles (bigrams) 3-shingles (trigrams) “k-shingling” e.g. all trigrams of the word “sentence”:
 [sen, ent, nte, ten, enc, nce] Token N-Grams Character N-Grams Token or Shingle
  • 8. Florian Leitner Lemmatization, Part-of-Speech (PoS) tagging, and Named Entity Recognition (NER) 7 Token Lemma PoS NER Constitutive constitutive JJ O binding binding NN O to to TO O the the DT O peri-! peri-kappa NN B-DNA B B NN I-DNA site site NN I-DNA is be VBZ O seen see VBN O in in IN O monocytes monocyte NNS B-cell . . . O de facto standard
 PoS tagset {NN, JJ, DT, VBZ, …} Penn Treebank B-I-O chunk encoding common alternatives: I-O I-E-O B-I-E-W-O End token (unigram) Word Stanford CoreNLP FACTORIE and many more… FreeLing Linguistic annotations of tokens (used to train automated classifiers). Begin-Inside-Outside (relevant) token } chunk
  • 9. Florian Leitner Word vectors and inverted indices 8 0 1 2 3 4 5 6 7 8 9 10 10 0 1 2 3 4 5 6 7 8 9 count(Word1) count(Word2) Text1 Text2 α γ β Similarity(T1 , T2 ) := cos(T1 , T2 ) count(Word3 ) Comparing text vectors: E.g., cosine similarity Text vectorization: Inverted index Text 1: He that not wills to the end neither wills to the means. Text 2: If the mountain will not go to Moses, then Moses must go to the mountain. tokens Text 1 Text 2 end 1 0 go 0 2 he 1 0 if 0 1 means 1 0 Moses 0 2 mountain 0 2 must 0 1 not 1 1 that 1 0 the 2 2 then 0 1 to 2 2 will 2 1 INDRI “Search engine basics” eachtoken/wordisadimension!
  • 10. Florian Leitner Inverted indices and
 the central dogma of machine learning 9 ×= y = h✓(X) XTy θ Rank, Class, Expectation, Probability, Descriptor*, … Inverted index (transposed) Parameters
 (θ) “texts”(n) n-grams (p) instances, observations variables, features (Hyperparameters are settings that control the learning algorithm.) per feature
  • 11. Florian Leitner Inverted indices and
 the central dogma of machine learning 9 ×= y = h✓(X) XTy θ Rank, Class, Expectation, Probability, Descriptor*, … Inverted index (transposed) Parameters
 (θ) “texts”(n) n-grams (p) instances, observations variables, features (Hyperparameters are settings that control the learning algorithm.) per feature “Nonparametric” per instance
  • 12. Florian Leitner The curse of dimensionality
 (R.E. Bellman, 1961) [inventor of dynamic programming] • p ≫ n (far more tokens/features than texts/instances) • Inverted indices (X) are (discrete) sparse matrices. • Even with millions of training examples, unseen tokens will keep popping up in during evaluation or in production. ‣ In such a high-dimensional hypercube, most instances are closer to the face of the cube (“nothing”, outside) than other instances. ✓ Remedy: (feature) dimensionality reduction
 The “blessing of non-uniformity.” • feature extraction (compression): PCA/LSA (projection), factor analysis (regression), compression, auto-encoders & deep learning (compression & embedding), … • feature selection (elimination): LASSO (regularization), SVM (support vectors), Bayesian nets (structure learning), locality sensitivity hashing, random projections, … 10
  • 14. Florian Leitner Google’s review summaries:
 Opinion mining (“sentiment” analysis). 12 Don’t do it, please… ;-) (If you must: see document and text classification software.)
  • 15. Florian Leitner Polarity of sentiment keywords in IMDB. • å 13 Cristopher Potts. On the negativity of negation. 2011 “not good”
  • 16. Florian Leitner Language understanding: Parsing and semantic analysis. 14 disambiguation! Coreference (Anaphora) Resolution Named Entity Recognition Apple Siri Stanford BLLIP (C-J) Malt LinkGrammar and many more…RedShift Entity Grounding disambiguation! disambiguation! L. TesnièreN. Chomsky
  • 17. Florian Leitner Automatic text summarization: Automatic text summarization: • Variance/human agreement: When is a summary “correct”? • Coherence: providing discourse structure (text flow) to the summary. • Paraphrasing: important sentences are repeated, but with different wordings. • Implied messages: (the Dow Jones index rose 10 points → the economy is thriving) • Anaphora (coreference) resolution: very hard, but crucial. 15 …is very difficult because… Image Source: www.lexalytics.com Lex[Page]Rank (JUNG) sumy TextTeaser the author got hired by Google…
  • 18. Florian Leitner Machine translation: Deep learning with auto-encoders. 16 ‣have only one gender (en) or use opposing genders
 (es vs. de: el/die !; la/der "; …/das #) ‣have different verb placements (es⬌de). ‣have a different concepts of verbs (latin, arab, cjk). ‣use different tenses (en⬌de). ‣have different word orders (latin, arab, cjk). Different languages… DL4J
  • 19. Florian Leitner Question answering: The champions league of TM & NLP. 17 Biggest issue: statistical inference IBM Watson WolframAlpha Category: Oscar Winning Movies Hint: Its final scene includes the line “I do wish we could chat longer, but I’m having an old friend for dinner” ! ! ! ! Answer: Silence of the Lamb All men are mortal. Socrates probably is a man… …Therefore, Socrates might be mortal. (cognitive computing)
  • 20. Florian Leitner Information extraction: Knowledge mining for molecular biology. 18 Biological Repositories Binary Interactions Named Entity Recognition Entity Associations Entity Mapping (Grounding) Relationship Extraction Relationship Annotations Cdk5 Rat TaxID 10116 UniProt Q03114 Experimental Methods Article Classification Biological Model Articles Short Factoid Question Answering Ontologies & Thesauri WWW MITIE OpenDMAP ClearTK
  • 21. Florian Leitner Text mining and language processing is all about resolving ambiguities. 19 Anaphora resolution Carl and Bob were fighting: “You should shut up,” Carl told him. Part-of-Speech tagging The robot wheels out the iron. Paraphrasing Unemployment is on the rise. vs The economy is slumping. Entity recognition & grounding Is Princeton really good for you?
  • 22. Florian Leitner Text mining and language processing is all about resolving ambiguities. 20 Anaphora resolution Carl and Bob were fighting: “You should shut up,” Carl told him. Part-of-Speech tagging The robot wheels out the iron. Paraphrasing Unemployment is on the rise. vs The economy is slumping. Entity recognition & grounding Is Princeton really good for you?