Intro to nlp

Introduction
to
Natural

Language
Processing

Rutu
Mulkar-‐Mehta,
PhD

Founder
and
Data
Scientist
@Ticary

@RutuMulkar

Co-‐hosted
Meetup

Data
Science
Dojo

http://www.meetup.com/data-‐science-‐dojo

Natural
Language
Processing

http://www.meetup.com/Natural-‐Language-‐Processing-‐Meetup/

About
Me

•  Founder
and
Data
Scientist
at
Ticary

•  Background:

– PhD
in
Natural
Language
Processing

– Computer
Science

•  Worked
on
applying
NLP
to:

– Healthcare

– SEO
(Search
Engine
Optimization)

– Other
Stuﬀ:
Sentiment
Analysis,
Question

Answering,
Natural
Language
Understanding
++

4

Agenda

•  Understanding
Natural
Language

•  Introduction
to
diﬀerent
NLP
Problems

•  Part
of
Speech
tagging

•  Linguistic
Resources

UNDERSTANDING
NATURAL

LANGUAGE

Some
Example
Sentences

•  Children
make
delicious
snacks

•  I
saw
the
Grand
Canyon
ﬂying
to
New
York

•  Stolen
painting
found
by
the
tree

•  Two
sentences:

– Monkeys
like
bananas
when
they
wake
up.

– Monkeys
like
bananas
when
they
are
ripe.

Why
is
NLP
Hard?

Brazil
crowds
attend
funeral
of
late
candidate
Campos

More
than
100,000
people
in
Brazil
have
paid
their
last
respects
to
the

late
presidential
candidate,
Eduardo
Campos,
who
died
in
a
plane

crash
on
Wednesday.

They
attended
a
funeral
Mass
and
ﬁlled
the
streets
of
the
city
of

Recife
to
follow
the
passage
of
his
coﬃn.

Later
this
week,
Mr.
Campos's
Socialist
Party
is
expected
to
appoint

former
Environment
Minister
Marina
Silva
as
a
replacement

candidate.

Mr.
Campos's
jet
crashed
in
bad
weather
in
Santos,
near
Sao
Paulo.

Investigators
are
still
trying
to
establish
the
exact
causes
of
the
crash,

which
killed
six
other
people.

Why
is
NLP
Hard?

Brazil
crowds
attend
funeral
of
late
candidate
Campos

More
than
100,000
people
in
Brazil
have
paid
their
last
respects
to
the

late
presidential
candidate,
Eduardo
Campos,
who
died
in
a
plane

crash
on
Wednesday.

They
attended
a
funeral
Mass
and
ﬁlled
the
streets
of
the
city
of

Recife
to
follow
the
passage
of
his
coﬃn.

Later
this
week,
Mr
Campos's
Socialist
Party
is
expected
to
appoint

former
Environment
Minister
Marina
Silva
as
a
replacement

candidate.

Mr
Campos's
jet
crashed
in
bad
weather
in
Santos,
near
Sao
Paulo.

Investigators
are
still
trying
to
establish
the
exact
causes
of
the
crash,

which
killed
six
other
people.

Why
is
NLP
Hard?

•  To
understand
the
current
event,
you
need
to

understand
several
other
concepts:

– Current
Event

– Background
Event

– Property

– references
to
other
events

– pronouns

NLP
TASKS

What
can
we
solve
with
Natural
Language
Processing

NLP
Tasks

•  Text
Categorization

•  Sentiment
Analysis

•  Information
Extraction

•  Information
Retrieval

•  Question
Answering

•  Text
Summarization

•  Machine
Translation

Text
Categorization

Input
Document

What
is
the
document
about:

sports:
0.2%

politics:
2%

entertainment:
96%

religion:
…

ﬁnance:
…

Text
Classification

finance.yahoo.com
sports.yahoo.com

make
your
own
wordle
using
wordle.net

Vocabulary
used
in
one
genre
of
text,
is
different
from

vocabulary
used
in
another
genre

Sentiment
Analysis

Sharp
screen resolution
Low
battery life
v

Product Reviews – Kindle Paperwhite

Sentiment
Analysis

•  What
are
people
saying?

–  Twitter

–  Reviews

–  Blogs

–  Emails

•  Can
be
for:

–  Products

–  Companies

–  Movies

–  Books

Sentiment
Analysis

Possible
Features

•  Important
keywords,
and
key
phrases:

–  POS:
dazzling,
brilliant,
phenomenal

–  NEG:
hideous,
awful,
unwatchable

•  Emoticons

–  POS
:-‐)

–  NEG
:-‐(

•  Ontologies

–  Wordnet:
https://wordnet.princeton.edu/

–  SentiWordnet:
http://sentiwordnet.isti.cnr.it/

Challenges

•  People
express
opinions
in
complex
ways

– “The
acting
was
great
and
the
plots
were
intense

and
mesmerizing,
but
I
hated
the
movie”

•  Sarcasm,
humor
and
other
expressions

– “It
was
a
great
movie
for
a
Sunday
nap.
I
only
fell

asleep
twice,
but
it
was
very
restful”

Information
Extraction

Input
Document

What
are
the
key

pieces
of
information
?

Location:

Time:

People:

…

Extracting
Named
Entities
from
Documents

Other
ways
for
IE
:

Hypernyms
(type
of)

colors
such
as
red,
blue
and
…

25

Other
ways
for
IE:

Synonyms

Find
diﬀerent
relations
between
2
concepts:

Microsoft
bought
Farecast

26

Information
Retrieval

Input
Document

What
are
the
documents

relevant
to
the
query?

Input
Document

Input
Document

Input
Document

Input
Document

query

Information
Retrieval

Q)
Which
documents
are
most
relevant
to
a

given
query?

A)
Similar
vocabulary
between
query
and

document?

Quantify
similarity
based
on
maximum
overlap

– Cosine
Similarity

– Jaccard
Similarity

Information
Retrieval

Q)
If
you
rewrite
the
query
–
will
that
give
you

more
precise
results?

A)
Yes!
It
is
called
“Query
Expansion”

Commercial
Search
Tools

•  Lucene

– http://lucene.apache.org/

•  ElasticSearch

– https://www.elastic.co/

Underlying
technology
in
most
of
these
is
the
same,
with
some
variations

Meetup
about
this
topic
scheduled
for
early
2016

Question
Answering
-‐
Closed

Input
Data
Source

Questions:

What
event
happened?

When
did
the
event
happen?

Why
did
the
event
happen?

How
long
was
the
event?

How
did
the
event
happen?

Question
Answering
-‐
Open

Types
of
Text
Summarization

•  Keyword
Summaries

–  Extract
signiﬁcant
Keywords
from
text

–  Easy
to
implement

–  Hard
to
understand
by
end
user
a

Types
of
Text
Summarization

•  Sentence/Phrase
Extraction

–  Extract
relevant
sentences

–  Medium-‐Hard
to
implement

–  Easy
for
end
user
to
understand

Types
of
Text
Summarization

•  Natural
Language
Understanding
and
Generation

–  Understand
meaning
of
text

–  Generate
sentences
from
meaning
of
original
text

–  Hard
to
implement

–  Easy
for
end
user

President
of
University

of
Missouri
resigned

after
graduate
student

hunger
strike
and
class

cancellations
by
faculty

Machine
Translation

translate.google.com

Why
is
MT
Hard?

•  It
is
not
a
1
to
1
translation

– In
the
previous
example
4
words
in
English

translate
into
2
in
Spanish

•  Grammar
is
diﬀerent
in
diﬀerent
languages

– SOV
(Subject
–
Object
–
Verb)

•  “She
him
loves”
(Hindi,
Japanese)

– SVO
(Subject
–
Verb
–
Object)

•  “She
loves
him”
(English,
Mandarin)

Machine
Translation

•  Waygoapp

•  Instantly
translated
Chinese,

Japanese
and
Korean

•  Simply
point
and
translate

•  Oﬄine

http://waygoapp.com/

LINGUISTIC
NUANCES

Back
to
the
basics

Example

All
the
gobulins
were
gramzies.

It
was
grimbleton.

What
are
the
underlined
words?

gobulins

•  Noun

gramzies

•  Noun
or
Adjective

grimbleton

•  Noun
or
Adjective

Why
is
the
example
important?

We
can
get
a
sense
of
what
the
word
means,

based
on
how
it
is
used
in
language.

Nouns

•  E.g.
cat,
car,
computer,
tree

•  Variations:

– Number:
singular,
plural

•  one
car,
two
cars

– Gender:
masculine,
feminine,
neuter

– Case:
nominative,
genitive,
accusative,
dative

Pronouns

•  Vary
in

–  E.g.
she,
ourselves,
mine

–  Person

–  Gender

•  his,
her

–  Number

–  Case:
nominative,
accusative,
possessive,
2nd

possessive

–  Reﬂexive
and
Anaphoric
Forms:

•  herself,
each
other

Determiners

•  Articles

– a,
an,
the

•  Demonstratives

– this,
that

Adjectives

•  Describe
Properties

– sunny,
beautiful,
calm

•  Attributive
and
predicative
properties

•  Agreement

– in
gender,
number

•  Comparative
and
superlative
forms

– derivative
and
periphrastic

•  positive
form

Verbs

•  Tense:
past,
present,
future

– danced,
dancing,
will
dance

•  Aspect:
progressive,
perfective

•  Voice:
active,
passive

•  Other:
number,
person

•  Arguments:
transitive,
intransitive,

ditransitive

Other
POS
tags

•  Adverbs

– happily

•  Prepositions

– of,
on,
in

•  Particles

– ran
a
bill
vs
ran
up
a
bill

Morphological
Analysis

•  Sleeps
=
sleep
+
v
+
3rd
Person
+
Singular

•  If
we
have
a
good
enough
grammar
with
all
of

these
rules,
we
have
a
good
shot
at

understanding
syntax
of
language

Automatic
Taggers

•  Almost
all
the
POS
taggers
use
the
Penn-‐Treebank

list
of
tags

•  https://www.ling.upenn.edu/courses/Fall_2003/
ling001/penn_treebank_pos.html

58

Automatic
Taggers

•  Almost
all
the
POS
taggers
use
the
Penn-‐Treebank
list
of

tags

•  https://www.ling.upenn.edu/courses/Fall_2003/ling001/
penn_treebank_pos.html

–  Nouns
:

•  NN
(house),
NNS(houses),
NNP(White
House),
NNPS

–  Verbs:

•  VB(say),
VBD(said),
VBG(saying),
VBN,
VBP,
VBZ

–  Adjectives:

•  JJ
(good),
JJR(better),
JJS(best)

–  Adverbs:
RB,
RBR,
RBS

–  Prepositions:
IN

59

POS
Tagging
and
Parsing

•  Stanford
Core
NLP

– http://nlp.stanford.edu:8080/corenlp/

•  NLTK

– Natural
Language
Toolkit

– You
need
to
provide
your
own
training
data,
and

train
models
for
NLTK
to
be
eﬀective

61

Other
Linguistic
Features
of
Interest

– We
want
to
get
nouns
and
verbs
into
a
root
form

E.g.

•  am,
are,
is
à
be

•  car,
cars,
car’s
à
car

– Two
approaches:

•  Stemming

•  Lemmatization

62

Stemming
and
Lemmatization

•  Lemmatization

–  use
of
a
vocabulary

–  morphological
analysis
of
words

–  returns
the
base
or
dictionary
form
of
a
word

–  base
form
is
known
as
the
lemma

–  e.g.
am,
are,
is
à
be

•  Stemming

–  crude
heuristic
process

–  chops
oﬀ
the
ends
of
words

–  hope
of
achieving
this
goal

–  e.g.
Marked
à
Mark,
Marker
à
Mark

63

Parsing
Resources

•  NLTK

– python,
low
accuracy,
fast

– http://www.nltk.org/

•  Stanford
Core
NLP

– java,
high
accuracy,
slow

– http://nlp.stanford.edu/software/corenlp.shtml

•  SpaCy

– python,
medium
accuracy,
fast

– https://spacy.io/

Other
Resources:
Ontologies

•  Wordnet

–  groups
words
when
they
have
the
same
meaning

–  represents
hierarchical
links
between
groups

–  E.g.
car
is
the
same
thing
as
an
automobile

•  SentiWordnet

•  Wordnet
+
Sentiment

•  ConceptNet

–  broader
relationships
than
WordNet

–  E.g.
bread
is
typically
found
near
a
toaster.

•  FrameNet

–  Frames
represent
concepts
and
their
associated
roles

SOMETHING
TO
THINK
ABOUT

Semantics
and
Word
Co-‐locations

•  It
is
important
to
know
which
words
occur

together

– Strong
Beer
vs
Powerful
Beer

– Big
Sister
vs
Large
Sister

•  Two
approaches
have
been
used

– Semantics
–
ontologies
and
word
meanings

– Statistics
–
word
colocations
and
probabilities

Thank
you
for
Listening

rutu@ticary.com

@RutuMulkar

Intro to nlp

More Related Content

What's hot

Viewers also liked

Similar to Intro to nlp

Recently uploaded

Intro to nlp