SlideShare a Scribd company logo
1 of 22
Download to read offline
Text mining and natural
language processing
Florian Leitner

Technical University of Madrid (UPM), Spain

!
Tyba

Madrid, ES, 12th of June, 2015
License:
Florian Leitner
Is language understanding & generation

key to artificial intelligence?
• “Her” (Samantha) Movie, 2013

• “The Singularity: ~2030”

Ray Kurzweil, Google’s director of engineering

• “Watson” & “CRUSH”

IBM’s bet on the future: Datastreams, Mainframes & AI
2
“predict crimes before they happen”
Criminal Reduction
Utilizing Statistical History
(IBM, reality)
!
Precogs
(Minority Report, movie)
if? when?
cognitive computing:
“processing information more like a
human than a machine”
GoogleGoogle
Florian Leitner
Examples of text mining and

natural language processing applications.
• Spam filtering

• Document classification

• Social media/brand monitoring

• Opinion mining (& text classification)

• Search engines

• Information retrieval

• Plagiarism detection

• Content-based recommendation systems

• Watson (Jeopardy!, IBM)

• Question answering

• Spelling correction

• Language modeling

• Website translation (Google)

• Machine translation

• Digital assistants (MS’ Clippy)

• Dialog systems (“Turing test”)

• Siri (Apple) and Google Now

• Speech recognit. & language understand.

• Event detection (in e-mails)

• Information extraction
3
TextMining
LanguageProcessing
Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)
Concepts & Terminology
Florian Leitner
Document and text

classification/clustering
5
1st Principal Component
2ndPrincipalComponent
document
distance
1st
Principal Component
2nd
PrincipalComponent
Centroid
Cluster
Supervised (“Learning to classify from examples”, e.g., spam filtering)

vs.

Unsupervised (“Exploratory grouping”, e.g., topic modeling)
LIBSVM
Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
“tokenization”
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Token or Shingle
Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
“tokenization”
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Snag: the terms “shingle”, “token” and “n-gram” are not used consistently…
but “n-gram” and “token” are far more common!
shingles
(unigrams)
2-shingles
(bigrams)
3-shingles
(trigrams)
“k-shingling”
e.g. all trigrams of the word “sentence”:

[sen, ent, nte, ten, enc, nce]
Token N-Grams
Character N-Grams
Token or Shingle
Florian Leitner
Lemmatization, Part-of-Speech (PoS) tagging, and
Named Entity Recognition (NER)
7
Token Lemma PoS NER
Constitutive constitutive JJ O
binding binding NN O
to to TO O
the the DT O
peri-! peri-kappa NN B-DNA
B B NN I-DNA
site site NN I-DNA
is be VBZ O
seen see VBN O
in in IN O
monocytes monocyte NNS B-cell
. . . O
de facto standard

PoS tagset

{NN, JJ, DT, VBZ, …}

Penn Treebank
B-I-O
chunk encoding
common

alternatives:

I-O

I-E-O

B-I-E-W-O
End token
(unigram) Word
Stanford CoreNLP FACTORIE and many more…
FreeLing
Linguistic annotations of tokens (used to train automated classifiers).
Begin-Inside-Outside
(relevant) token
}
chunk
Florian Leitner
Word vectors and inverted indices
8
0 1 2 3 4 5 6 7 8 9 10
10
0
1
2
3
4
5
6
7
8
9
count(Word1)
count(Word2)
Text1
Text2
α
γ
β
Similarity(T1
, T2
) := cos(T1
, T2
)
count(Word3
)
Comparing text vectors:

E.g., cosine similarity
Text vectorization:

Inverted index
Text 1: He that not wills to the end neither

wills to the means.

Text 2: If the mountain will not go to Moses,

then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1 INDRI
“Search engine basics”
eachtoken/wordisadimension!
Florian Leitner
Inverted indices and

the central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters

(θ)
“texts”(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
Florian Leitner
Inverted indices and

the central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters

(θ)
“texts”(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
“Nonparametric”
per instance
Florian Leitner
The curse of dimensionality

(R.E. Bellman, 1961) [inventor of dynamic programming]
• p ≫ n (far more tokens/features than texts/instances)

• Inverted indices (X) are (discrete) sparse matrices.

• Even with millions of training examples, unseen tokens will keep
popping up in during evaluation or in production.

‣ In such a high-dimensional hypercube, most instances are closer to
the face of the cube (“nothing”, outside) than other instances.

✓ Remedy: (feature) dimensionality reduction

The “blessing of non-uniformity.”

• feature extraction (compression): PCA/LSA (projection), factor analysis (regression),
compression, auto-encoders & deep learning (compression & embedding), …

• feature selection (elimination): LASSO (regularization), SVM (support vectors),
Bayesian nets (structure learning), locality sensitivity hashing, random projections, …
10
Applications
Florian Leitner
Google’s review summaries:

Opinion mining (“sentiment” analysis).
12
Don’t do it, please… ;-) (If you must: see document and text classification software.)
Florian Leitner
Polarity of sentiment keywords in IMDB.
• å
13
Cristopher Potts. On the negativity of negation. 2011
“not good”
Florian Leitner
Language understanding:
Parsing and semantic analysis.
14
disambiguation!
Coreference
(Anaphora)
Resolution
Named Entity
Recognition
Apple Siri
Stanford BLLIP (C-J) Malt LinkGrammar and many more…RedShift
Entity
Grounding
disambiguation!
disambiguation!
L. TesnièreN. Chomsky
Florian Leitner
Automatic text summarization:
Automatic text summarization:
• Variance/human agreement: When is a
summary “correct”?

• Coherence: providing discourse
structure (text flow) to the summary.

• Paraphrasing: important sentences are
repeated, but with different wordings.

• Implied messages: (the Dow Jones
index rose 10 points → the economy is
thriving)

• Anaphora (coreference) resolution:
very hard, but crucial.
15
…is very difficult because…
Image Source: www.lexalytics.com
Lex[Page]Rank (JUNG) sumy TextTeaser
the author got hired by Google…
Florian Leitner
Machine translation:
Deep learning with auto-encoders.
16
‣have only one gender (en) or use opposing genders

(es vs. de: el/die !; la/der "; …/das #)
‣have different verb placements (es⬌de).
‣have a different concepts of verbs (latin, arab, cjk).
‣use different tenses (en⬌de).
‣have different word orders (latin, arab, cjk).
Different languages…
DL4J
Florian Leitner
Question answering:
The champions league of TM & NLP.
17
Biggest issue: statistical inference
IBM Watson WolframAlpha
Category: Oscar Winning Movies
Hint: Its final scene includes the line “I
do wish we could chat longer, but I’m
having an old friend for dinner”
!
!
!
!
Answer: Silence of the Lamb
All men are mortal.

Socrates probably is a man…
…Therefore, Socrates

might be mortal.
(cognitive computing)
Florian Leitner
Information extraction:
Knowledge mining for molecular biology.
18
Biological
Repositories
Binary
Interactions
Named Entity
Recognition
Entity Associations
Entity Mapping
(Grounding)
Relationship
Extraction
Relationship
Annotations
Cdk5 Rat
TaxID
10116
UniProt
Q03114
Experimental
Methods
Article
Classification
Biological Model
Articles
Short Factoid
Question Answering
Ontologies & Thesauri
WWW
MITIE OpenDMAP ClearTK
Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
19
Anaphora resolution
Carl and Bob were fighting:
“You should shut up,”
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?
Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
20
Anaphora resolution
Carl and Bob were fighting:
“You should shut up,”
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?

More Related Content

What's hot

Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoLidia Pivovarova
 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translationHiroshi Matsumoto
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
AINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, NikolenkoAINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, NikolenkoLidia Pivovarova
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)Nitesh Singh
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...Francisco Manuel Rangel Pardo
 
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsCodeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsTobias Kuhn
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 

What's hot (20)

Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
AINL 2016: Maraev
AINL 2016: MaraevAINL 2016: Maraev
AINL 2016: Maraev
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, Nikolenko
 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translation
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
AINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, NikolenkoAINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, Nikolenko
 
Esa act
Esa actEsa act
Esa act
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsCodeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 

Viewers also liked

Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Chi-Yi Kuan
 
Aplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESANAplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESANYabed Contreras Zambrano
 
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣAndreas Iatridis
 
Python + NoSQL in Animations
Python + NoSQL in AnimationsPython + NoSQL in Animations
Python + NoSQL in AnimationsShuen-Huei Guan
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
Text mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsingText mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsingFlorian Leitner
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text MiningHemant Sharma
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTKFrancesco Bruni
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Yuliya Rubtsova
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language ProcessingJaganadh Gopinadhan
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 

Viewers also liked (20)

Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
 
Aplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESANAplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESAN
 
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
ΣΥΝΟΠΤΙΚΗ ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΣΤΑΘΜΩΝ ΤΟΥ ΠΙΛΟΤΙΚΟΥ ΕΡΓΟΥ ΤΗΣ ΔΡΑΜΑΣ
 
Python + NoSQL in Animations
Python + NoSQL in AnimationsPython + NoSQL in Animations
Python + NoSQL in Animations
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Yahoo answers
Yahoo answersYahoo answers
Yahoo answers
 
Text mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsingText mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsing
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTK
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Text mining
Text miningText mining
Text mining
 
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 

Similar to Overview of text mining and NLP (+software)

Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Lucidworks
 
ODSC London 2018
ODSC London 2018ODSC London 2018
ODSC London 2018Kfir Bar
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonAlyona Medelyan
 
PPT slides
PPT slidesPPT slides
PPT slidesbutest
 
Smart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingSmart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingDATAVERSITY
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
F# Eye for the C# Guy
F# Eye for the C# GuyF# Eye for the C# Guy
F# Eye for the C# Guygueste3f83d
 
Machine reading for the Semantic Web
Machine reading for the Semantic WebMachine reading for the Semantic Web
Machine reading for the Semantic WebSTLab
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesZoltan Varju
 
Data Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docxData Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docxtheodorelove43763
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...MITRE - ATT&CKcon
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryAniruddha Chakrabarti
 
Lean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicLean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicValeria de Paiva
 

Similar to Overview of text mining and NLP (+software) (20)

Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
 
ODSC London 2018
ODSC London 2018ODSC London 2018
ODSC London 2018
 
Weakly supervised learning
Weakly supervised learningWeakly supervised learning
Weakly supervised learning
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with Python
 
PPT slides
PPT slidesPPT slides
PPT slides
 
Smart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingSmart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language Processing
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
F# Eye for the C# Guy
F# Eye for the C# GuyF# Eye for the C# Guy
F# Eye for the C# Guy
 
Machine reading for the Semantic Web
Machine reading for the Semantic WebMachine reading for the Semantic Web
Machine reading for the Semantic Web
 
Nltk
NltkNltk
Nltk
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
 
Data Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docxData Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docx
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural Library
 
Lean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicLean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural Logic
 
Angular and Deep Learning
Angular and Deep LearningAngular and Deep Learning
Angular and Deep Learning
 

Recently uploaded

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Recently uploaded (20)

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Overview of text mining and NLP (+software)

  • 1. Text mining and natural language processing Florian Leitner Technical University of Madrid (UPM), Spain ! Tyba Madrid, ES, 12th of June, 2015 License:
  • 2. Florian Leitner Is language understanding & generation
 key to artificial intelligence? • “Her” (Samantha) Movie, 2013 • “The Singularity: ~2030”
 Ray Kurzweil, Google’s director of engineering • “Watson” & “CRUSH”
 IBM’s bet on the future: Datastreams, Mainframes & AI 2 “predict crimes before they happen” Criminal Reduction Utilizing Statistical History (IBM, reality) ! Precogs (Minority Report, movie) if? when? cognitive computing: “processing information more like a human than a machine” GoogleGoogle
  • 3. Florian Leitner Examples of text mining and
 natural language processing applications. • Spam filtering • Document classification • Social media/brand monitoring • Opinion mining (& text classification) • Search engines • Information retrieval • Plagiarism detection • Content-based recommendation systems • Watson (Jeopardy!, IBM) • Question answering • Spelling correction • Language modeling • Website translation (Google) • Machine translation • Digital assistants (MS’ Clippy) • Dialog systems (“Turing test”) • Siri (Apple) and Google Now • Speech recognit. & language understand. • Event detection (in e-mails) • Information extraction 3 TextMining LanguageProcessing Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)
  • 5. Florian Leitner Document and text
 classification/clustering 5 1st Principal Component 2ndPrincipalComponent document distance 1st Principal Component 2nd PrincipalComponent Centroid Cluster Supervised (“Learning to classify from examples”, e.g., spam filtering) vs. Unsupervised (“Exploratory grouping”, e.g., topic modeling) LIBSVM
  • 6. Florian Leitner Words, Tokens, and N-Grams/Shingles 6 This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: “tokenization” Splitting: Character-based, Regular Expressions, Probabilistic, … Token or Shingle
  • 7. Florian Leitner Words, Tokens, and N-Grams/Shingles 6 This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: “tokenization” Splitting: Character-based, Regular Expressions, Probabilistic, … Snag: the terms “shingle”, “token” and “n-gram” are not used consistently… but “n-gram” and “token” are far more common! shingles (unigrams) 2-shingles (bigrams) 3-shingles (trigrams) “k-shingling” e.g. all trigrams of the word “sentence”:
 [sen, ent, nte, ten, enc, nce] Token N-Grams Character N-Grams Token or Shingle
  • 8. Florian Leitner Lemmatization, Part-of-Speech (PoS) tagging, and Named Entity Recognition (NER) 7 Token Lemma PoS NER Constitutive constitutive JJ O binding binding NN O to to TO O the the DT O peri-! peri-kappa NN B-DNA B B NN I-DNA site site NN I-DNA is be VBZ O seen see VBN O in in IN O monocytes monocyte NNS B-cell . . . O de facto standard
 PoS tagset {NN, JJ, DT, VBZ, …} Penn Treebank B-I-O chunk encoding common alternatives: I-O I-E-O B-I-E-W-O End token (unigram) Word Stanford CoreNLP FACTORIE and many more… FreeLing Linguistic annotations of tokens (used to train automated classifiers). Begin-Inside-Outside (relevant) token } chunk
  • 9. Florian Leitner Word vectors and inverted indices 8 0 1 2 3 4 5 6 7 8 9 10 10 0 1 2 3 4 5 6 7 8 9 count(Word1) count(Word2) Text1 Text2 α γ β Similarity(T1 , T2 ) := cos(T1 , T2 ) count(Word3 ) Comparing text vectors: E.g., cosine similarity Text vectorization: Inverted index Text 1: He that not wills to the end neither wills to the means. Text 2: If the mountain will not go to Moses, then Moses must go to the mountain. tokens Text 1 Text 2 end 1 0 go 0 2 he 1 0 if 0 1 means 1 0 Moses 0 2 mountain 0 2 must 0 1 not 1 1 that 1 0 the 2 2 then 0 1 to 2 2 will 2 1 INDRI “Search engine basics” eachtoken/wordisadimension!
  • 10. Florian Leitner Inverted indices and
 the central dogma of machine learning 9 ×= y = h✓(X) XTy θ Rank, Class, Expectation, Probability, Descriptor*, … Inverted index (transposed) Parameters
 (θ) “texts”(n) n-grams (p) instances, observations variables, features (Hyperparameters are settings that control the learning algorithm.) per feature
  • 11. Florian Leitner Inverted indices and
 the central dogma of machine learning 9 ×= y = h✓(X) XTy θ Rank, Class, Expectation, Probability, Descriptor*, … Inverted index (transposed) Parameters
 (θ) “texts”(n) n-grams (p) instances, observations variables, features (Hyperparameters are settings that control the learning algorithm.) per feature “Nonparametric” per instance
  • 12. Florian Leitner The curse of dimensionality
 (R.E. Bellman, 1961) [inventor of dynamic programming] • p ≫ n (far more tokens/features than texts/instances) • Inverted indices (X) are (discrete) sparse matrices. • Even with millions of training examples, unseen tokens will keep popping up in during evaluation or in production. ‣ In such a high-dimensional hypercube, most instances are closer to the face of the cube (“nothing”, outside) than other instances. ✓ Remedy: (feature) dimensionality reduction
 The “blessing of non-uniformity.” • feature extraction (compression): PCA/LSA (projection), factor analysis (regression), compression, auto-encoders & deep learning (compression & embedding), … • feature selection (elimination): LASSO (regularization), SVM (support vectors), Bayesian nets (structure learning), locality sensitivity hashing, random projections, … 10
  • 14. Florian Leitner Google’s review summaries:
 Opinion mining (“sentiment” analysis). 12 Don’t do it, please… ;-) (If you must: see document and text classification software.)
  • 15. Florian Leitner Polarity of sentiment keywords in IMDB. • å 13 Cristopher Potts. On the negativity of negation. 2011 “not good”
  • 16. Florian Leitner Language understanding: Parsing and semantic analysis. 14 disambiguation! Coreference (Anaphora) Resolution Named Entity Recognition Apple Siri Stanford BLLIP (C-J) Malt LinkGrammar and many more…RedShift Entity Grounding disambiguation! disambiguation! L. TesnièreN. Chomsky
  • 17. Florian Leitner Automatic text summarization: Automatic text summarization: • Variance/human agreement: When is a summary “correct”? • Coherence: providing discourse structure (text flow) to the summary. • Paraphrasing: important sentences are repeated, but with different wordings. • Implied messages: (the Dow Jones index rose 10 points → the economy is thriving) • Anaphora (coreference) resolution: very hard, but crucial. 15 …is very difficult because… Image Source: www.lexalytics.com Lex[Page]Rank (JUNG) sumy TextTeaser the author got hired by Google…
  • 18. Florian Leitner Machine translation: Deep learning with auto-encoders. 16 ‣have only one gender (en) or use opposing genders
 (es vs. de: el/die !; la/der "; …/das #) ‣have different verb placements (es⬌de). ‣have a different concepts of verbs (latin, arab, cjk). ‣use different tenses (en⬌de). ‣have different word orders (latin, arab, cjk). Different languages… DL4J
  • 19. Florian Leitner Question answering: The champions league of TM & NLP. 17 Biggest issue: statistical inference IBM Watson WolframAlpha Category: Oscar Winning Movies Hint: Its final scene includes the line “I do wish we could chat longer, but I’m having an old friend for dinner” ! ! ! ! Answer: Silence of the Lamb All men are mortal. Socrates probably is a man… …Therefore, Socrates might be mortal. (cognitive computing)
  • 20. Florian Leitner Information extraction: Knowledge mining for molecular biology. 18 Biological Repositories Binary Interactions Named Entity Recognition Entity Associations Entity Mapping (Grounding) Relationship Extraction Relationship Annotations Cdk5 Rat TaxID 10116 UniProt Q03114 Experimental Methods Article Classification Biological Model Articles Short Factoid Question Answering Ontologies & Thesauri WWW MITIE OpenDMAP ClearTK
  • 21. Florian Leitner Text mining and language processing is all about resolving ambiguities. 19 Anaphora resolution Carl and Bob were fighting: “You should shut up,” Carl told him. Part-of-Speech tagging The robot wheels out the iron. Paraphrasing Unemployment is on the rise. vs The economy is slumping. Entity recognition & grounding Is Princeton really good for you?
  • 22. Florian Leitner Text mining and language processing is all about resolving ambiguities. 20 Anaphora resolution Carl and Bob were fighting: “You should shut up,” Carl told him. Part-of-Speech tagging The robot wheels out the iron. Paraphrasing Unemployment is on the rise. vs The economy is slumping. Entity recognition & grounding Is Princeton really good for you?