SlideShare a Scribd company logo
1 of 37
Natural Language Processing
+ Python
by Ann C. Tan-Pohlmann

February 22, 2014
Outline
• NLP Basics
• NLTK
– Text Processing

• Gensim (really, really short )
– Text Classification

2
Natural Language Processing
• computer science, artificial intelligence, and
linguistics
• human–computer interaction
• natural language understanding
• natural language generation
- Wikipedia

3
Star Trek's Universal Translator

http://www.youtube.com/watch?v=EaeSKU
V2zp0
Spoken Dialog Systems

5
NLP Basics
• Morphology
– study of word formation
– how word forms vary in a sentence

• Syntax
– branch of grammar
– how words are arranged in a sentence to show
connections of meaning

• Semantics
– study of meaning of words, phrases and sentences
6
NLTK: Getting Started
• Natural Language Took Kit
– for symbolic and statistical NLP
– teaching tool, study tool and as a platform for prototyping

• Python 2.7 is a prerequisite
>>> import nltk
>>> nltk.download()

7
Some NLTK methods
•
•
•
•
•

Frequency Distribution

text.similar(str)
concordance(str)
len(text)
len(set(text))
lexical_diversity

•
•
•
•
•

– len(text)/
len(set(text))

fd = FreqDist(text)
fd.inc(str)
fd[str]
fd.N()
fd.max()

• text.collocations()
- sequence of words that occur
together often

MORPHOLOGY > Syntax > Semantics

8
Frequency Distribution
•
•
•
•
•

fd = FreqDist(text)
fd.inc(str) – increment count
fd[str] – returns the number of occurrence for sample str
fd.N() – total number of samples
fd.max() – sample with the greatest count

9
Corpus
• large collection of raw or categorized text on
one or more domain
• Examples: Gutenberg, Brown, Reuters, Web &
Chat Txt
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', '
humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
>>> adventure_text = brown.words(categories='adventure')

10
Corpora in Other Languages
>>> from nltk.corpus import udhr
>>> languages = nltk.corpus.udhr.fileids()
>>> languages.index('Filipino_Tagalog-Latin1')
>>> tagalog = nltk.corpus.udhr.raw('Filipino_Tagalog-Latin1')
>>> tagalog_words = nltk.corpus.udhr.words('Filipino_Tagalog-Latin1')
>>> tagalog_tokens = nltk.word_tokenize(tagalog)
>>> tagalog_text = nltk.Text(tagalog_tokens)
>>> fd = FreqDist(tagalog_text)
>>> for sample in fd:
... print sample

11
Using Corpus from Palito
Corpus
– large collection of raw or categorized text
>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_dir = '/Users/ann/Downloads'
>>> tagalog = PlaintextCorpusReader(corpus_dir,
'Tagalog_Literary_Text.txt')
>>> raw = tagalog.raw()
>>> sentences = tagalog.sents()
>>> words = tagalog.words()
>>> tokens = nltk.word_tokenize(raw)
>>> tagalog_text = nltk.Text(tokens)
12
Spoken Dialog Systems

MORPHOLOGY > Syntax > Semantics

13
Tokenization
Tokenization
– breaking up of string into words and punctuations

>>> tokens = nltk.word_tokenize(raw)
>>> tagalog_tokens = nltk.Text(tokens)
>>> tagalog_tokens = set(sample.lower() for sample in tagalog_tokens)

MORPHOLOGY > Syntax > Semantics

14
Stemming
Stemming
– normalize words into its base form, result may not be the 'root' word
>>> def stem(word):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...
if word.endswith(suffix):
...
return word[:-len(suffix)]
... return word
...
>>> stem('reading')
'read'
>>> stem('moment')
'mo'

MORPHOLOGY > Syntax > Semantics

15
Lemmatization
Lemmatization
– uses vocabulary list and morphological analysis (uses POS of a word)
>>> def stem(word):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...
if word.endswith(suffix) and word[:-len(suffix)] in brown.words():
...
return word[:-len(suffix)]
... return word
...
>>> stem('reading')
'read'
>>> stem('moment')
'moment'

MORPHOLOGY > Syntax > Semantics

16
NLTK Stemmers & Lemmatizer
• Porter Stemmer and Lancaster Stemmer
>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()
>>> [porter.stem(w) for w in brown.words()[:100]]

• Word Net Lemmatizer
>>> wnl = nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(w) for w in brown.words()[:100]]

• Comparison
>>> [wnl.lemmatize(w) for w in ['investigation', 'women']]
>>> [porter.stem(w) for w in ['investigation', 'women']]
>>> [lancaster.stem(w) for w in ['investigation', 'women']]

MORPHOLOGY > Syntax > Semantics

17
Using Regular Expression
Operator
.
^abc
abc$
[abc]
[A-Z0-9]
ed|ing|s
*
+
?
{n}
{n,}
{,n}
{m,n}
a(b|c)+

Behavior
Wildcard, matches any character
Matches some pattern abc at the start of a string
Matches some pattern abc at the end of a string
Matches one of a set of characters
Matches one of a range of characters
Matches one of the specified strings (disjunction)
Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
One or more of previous item, e.g. a+, [a-z]+
Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
Exactly n repeats where n is a non-negative integer
At least n repeats
No more than n repeats
At least m and no more than n repeats
Parentheses that indicate the scope of the operators

MORPHOLOGY > Syntax > Semantics

18
Using Regular Expression
>>> import re
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'reading')
[('read', 'ing')]
>>> def stem(word):
... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
... stem, suffix = re.findall(regexp, word)[0]
... return stem
...
>>> stem('reading')
'read'
>>> stem('moment')
'moment'

MORPHOLOGY > Syntax > Semantics

19
Spoken Dialog Systems

Morphology > SYNTAX > Semantics

20
Lexical Resources
• collection of words with association information (annotation)
• Ex: stopwords – high-frequency words with little lexical
content
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
>>> stopwords.words('german')

MORPHOLOGY > Syntax > Semantics

21
Part-of-Speech (POS) Tagging
• the process of labeling and classifying words
to a particular part of speech based on its
definition and context

Morphology > SYNTAX > Semantics

22
NLTKs POS Tag Sets* – 1/2
Tag
ADJ
ADV
CNJ
DET
EX
FW
MOD
N
NP

Meaning
adjective
adverb
conjunction
determiner
existential
foreign word
modal verb
noun
proper noun

Examples
new, good, high, special, big, local
really, already, still, early, now
and, or, but, if, while, although
the, a, some, most, every, no
there, there's
dolce, ersatz, esprit, quo, maitre
will, can, would, may, must, should
year, home, costs, time, education
Alison, Africa, April, Washington

*simplified
Morphology > SYNTAX > Semantics

23
NLTKs POS Tag Sets* – 2/2
Tag
NUM
PRO
P
TO
UH
V
VD
VG
VN
WH

Meaning
number
pronoun
preposition
the word to
interjection
verb
past tense
present participle
past participle
wh determiner

Examples
twenty-four, fourth, 1991, 14:24
he, their, her, its, my, I, us
on, of, at, with, by, into, under
to
ah, bang, ha, whee, hmpf, oops
is, has, get, do, make, see, run
said, took, told, made, asked
making, going, playing, working
given, taken, begun, sung
who, which, when, what, where, how

*simplified
Morphology > SYNTAX > Semantics

24
NLTK POS Tagger (Brown)
>>> nltk.pos_tag(brown.words()[:30])
[('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'),
('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'),
('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'JJ'), ('recent', 'JJ'),
('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBN'), ('``', '``'), ('no',
'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'WDT'), ('any', 'DT'),
('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The',
'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN')]
>>> brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]

Morphology > SYNTAX > Semantics

25
NLTK POS Tagger (German)
>>> german = nltk.corpus.europarl_raw.german
>>> nltk.pos_tag(german.words()[:30])
[(u'Wiederaufnahme', 'NNP'), (u'der', 'NN'), (u'Sitzungsperiode', 'NNP'),
(u'Ich', 'NNP'), (u'erklxe4re', 'NNP'), (u'die', 'VB'), (u'am', 'NN'), (u'Freita
g', 'NNP'), (u',', ','), (u'dem', 'NN'), (u'17.', 'CD'), (u'Dezember', 'NNP'), (u'
unterbrochene', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'des', 'VBZ'), (u'Eur
opxe4ischen', 'JJ'), (u'Parlaments', 'NNS'), (u'fxfcr', 'JJ'), (u'wiederaufg
enommen', 'NNS'), (u',', ','), (u'wxfcnsche', 'NNP'), (u'Ihnen', 'NNP'), (u'
nochmals', 'NNS'), (u'alles', 'VBZ'), (u'Gute', 'NNP'), (u'zum', 'NN'), (u'Ja
hreswechsel', 'NNP'), (u'und', 'NN'), (u'hoffe', 'NN'), (u',', ',')]

xe4 = ä xfc = ü
!!! DOES NOT WORK FOR GERMAN

Morphology > SYNTAX > Semantics

26
NLTK POS Dictionary
>>> pos = nltk.defaultdict(lambda:'N')
>>> pos['eat']
'N'
>>> pos.items()
[('eat', 'N')]
>>> for (word, tag) in brown.tagged_words(simplify_tags=True):
... if word in pos:
...
if isinstance(pos[word], str):
...
new_list = [pos[word]]
...
pos[word] = new_list
...
if tag not in pos[word]:
...
pos[word].append(tag)
... else:
...
pos[word] = [tag]
...
>>> pos['eat']
['N', 'V']
Morphology > SYNTAX > Semantics

27
What else can you do with NLTK?
• Other Taggers
– Unigram Tagging
• nltk.UnigramTagger()
• train tagger using tagged sentence data

– N-gram Tagging

• Text classification using machine learning
techniques
– decision trees
– naïve Bayes classification (supervised)
– Markov Models
Morphology > SYNTAX > SEMANTICS

28
Gensim
• Tool that extracts semantic structure of
documents, by examining word statistical cooccurrence patterns within a corpus of
training documents.
• Algorithms:
1. Latent Semantic Analysis (LSA)
2. Latent Dirichlet Allocation (LDA) or Random
Projections
Morphology > Syntax > SEMANTICS

29
Gensim
• Features
– memory independent
– wrappers/converters for several data formats

• Vector
– representation of the document as an array of features or
question-answer pair
1.
2.
3.

(word occurrence, count)
(paragraph, count)
(font, count)

• Model
– transformation from one vector to another
– learned from a training corpus without supervision
Morphology > Syntax > SEMANTICS

30
Wiki document classification

http://radimrehurek.com/gensim/wiki.html

31
Other NLP tools for Python
• TextBlob
– part-of-speech tagging, noun phrase extraction,
sentiment analysis, classification, translation
– https://pypi.python.org/pypi/textblob

• Pattern
– part-of-speech taggers, n-gram search, sentiment
analysis, WordNet, machine learning
– http://www.clips.ua.ac.be/pattern
32
Star Trek technology that became a reality

http://www.youtube.com/watch?v=sRZxwR
IH9RI
Installation Guides
• NLTK
– http://www.nltk.org/install.html
– http://www.nltk.org/data.html

• Gensim
– http://radimrehurek.com/gensim/install.html

• Palito
– http://ccs.dlsu.edu.ph:8086/Palito/find_project.js
p
34
Using iPython
• http://ipython.org/install.html
>>> documents = ["Human machine interface for lab abc computer applications",
>>>
"A survey of user opinion of computer system response time",
>>>
"The EPS user interface management system",
>>>
"System and human system engineering testing of EPS",
>>>
"Relation of user perceived response time to error measurement",
>>>
"The generation of random binary unordered trees",
>>>
"The intersection graph of paths in trees",
>>>
"Graph minors IV Widths of trees and well quasi ordering",
>>>
"Graph minors A survey"]

35
References
• Natural Language Processing with Python By
Steven Bird, Ewan Klein, Edward Loper
• http://www.nltk.org/book/
• http://radimrehurek.com/gensim/tutorial.htm
l

36
Thank You!
• For questions and comments:
- ann at auberonsolutions dot com

37

More Related Content

What's hot

Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational Biology
AtreyiB
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
Gopi Krishnan Nambiar
 

What's hot (20)

Python and sysadmin I
Python and sysadmin IPython and sysadmin I
Python and sysadmin I
 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational Biology
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azure
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Python for Penetration testers
Python for Penetration testersPython for Penetration testers
Python for Penetration testers
 
Python build your security tools.pdf
Python build your security tools.pdfPython build your security tools.pdf
Python build your security tools.pdf
 
Migrating to Puppet 4.0
Migrating to Puppet 4.0Migrating to Puppet 4.0
Migrating to Puppet 4.0
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
IO Streams, Files and Directories
IO Streams, Files and DirectoriesIO Streams, Files and Directories
IO Streams, Files and Directories
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
 
Biopython
BiopythonBiopython
Biopython
 
An introduction to Python for absolute beginners
An introduction to Python for absolute beginnersAn introduction to Python for absolute beginners
An introduction to Python for absolute beginners
 
The Ring programming language version 1.7 book - Part 43 of 196
The Ring programming language version 1.7 book - Part 43 of 196The Ring programming language version 1.7 book - Part 43 of 196
The Ring programming language version 1.7 book - Part 43 of 196
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Designing with Groovy Traits - Gr8Conf India
Designing with Groovy Traits - Gr8Conf IndiaDesigning with Groovy Traits - Gr8Conf India
Designing with Groovy Traits - Gr8Conf India
 
Your Own Metric System
Your Own Metric SystemYour Own Metric System
Your Own Metric System
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
Streams, sockets and filters oh my!
Streams, sockets and filters oh my!Streams, sockets and filters oh my!
Streams, sockets and filters oh my!
 

Similar to Natural Language Processing and Python

Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
Software Guru
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
g3_nittala
 
Declare Your Language (at DLS)
Declare Your Language (at DLS)Declare Your Language (at DLS)
Declare Your Language (at DLS)
Eelco Visser
 

Similar to Natural Language Processing and Python (20)

한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남
 
Term Rewriting
Term RewritingTerm Rewriting
Term Rewriting
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 
Declare Your Language: Type Checking
Declare Your Language: Type CheckingDeclare Your Language: Type Checking
Declare Your Language: Type Checking
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
CS4200 2019 | Lecture 3 | Parsing
CS4200 2019 | Lecture 3 | ParsingCS4200 2019 | Lecture 3 | Parsing
CS4200 2019 | Lecture 3 | Parsing
 
Music as data
Music as dataMusic as data
Music as data
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Context
 
Separation of Concerns in Language Definition
Separation of Concerns in Language DefinitionSeparation of Concerns in Language Definition
Separation of Concerns in Language Definition
 
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
 
Ch2
Ch2Ch2
Ch2
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
Open course(programming languages) 20150225
Open course(programming languages) 20150225Open course(programming languages) 20150225
Open course(programming languages) 20150225
 
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
 
Declare Your Language (at DLS)
Declare Your Language (at DLS)Declare Your Language (at DLS)
Declare Your Language (at DLS)
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)
 
Alastair Butler - 2015 - Round trips with meaning stopovers
Alastair Butler - 2015 - Round trips with meaning stopoversAlastair Butler - 2015 - Round trips with meaning stopovers
Alastair Butler - 2015 - Round trips with meaning stopovers
 
Poetic APIs
Poetic APIsPoetic APIs
Poetic APIs
 
Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014
 
Quepy
QuepyQuepy
Quepy
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Natural Language Processing and Python

  • 1. Natural Language Processing + Python by Ann C. Tan-Pohlmann February 22, 2014
  • 2. Outline • NLP Basics • NLTK – Text Processing • Gensim (really, really short ) – Text Classification 2
  • 3. Natural Language Processing • computer science, artificial intelligence, and linguistics • human–computer interaction • natural language understanding • natural language generation - Wikipedia 3
  • 4. Star Trek's Universal Translator http://www.youtube.com/watch?v=EaeSKU V2zp0
  • 6. NLP Basics • Morphology – study of word formation – how word forms vary in a sentence • Syntax – branch of grammar – how words are arranged in a sentence to show connections of meaning • Semantics – study of meaning of words, phrases and sentences 6
  • 7. NLTK: Getting Started • Natural Language Took Kit – for symbolic and statistical NLP – teaching tool, study tool and as a platform for prototyping • Python 2.7 is a prerequisite >>> import nltk >>> nltk.download() 7
  • 8. Some NLTK methods • • • • • Frequency Distribution text.similar(str) concordance(str) len(text) len(set(text)) lexical_diversity • • • • • – len(text)/ len(set(text)) fd = FreqDist(text) fd.inc(str) fd[str] fd.N() fd.max() • text.collocations() - sequence of words that occur together often MORPHOLOGY > Syntax > Semantics 8
  • 9. Frequency Distribution • • • • • fd = FreqDist(text) fd.inc(str) – increment count fd[str] – returns the number of occurrence for sample str fd.N() – total number of samples fd.max() – sample with the greatest count 9
  • 10. Corpus • large collection of raw or categorized text on one or more domain • Examples: Gutenberg, Brown, Reuters, Web & Chat Txt >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', ' humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> adventure_text = brown.words(categories='adventure') 10
  • 11. Corpora in Other Languages >>> from nltk.corpus import udhr >>> languages = nltk.corpus.udhr.fileids() >>> languages.index('Filipino_Tagalog-Latin1') >>> tagalog = nltk.corpus.udhr.raw('Filipino_Tagalog-Latin1') >>> tagalog_words = nltk.corpus.udhr.words('Filipino_Tagalog-Latin1') >>> tagalog_tokens = nltk.word_tokenize(tagalog) >>> tagalog_text = nltk.Text(tagalog_tokens) >>> fd = FreqDist(tagalog_text) >>> for sample in fd: ... print sample 11
  • 12. Using Corpus from Palito Corpus – large collection of raw or categorized text >>> import nltk >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_dir = '/Users/ann/Downloads' >>> tagalog = PlaintextCorpusReader(corpus_dir, 'Tagalog_Literary_Text.txt') >>> raw = tagalog.raw() >>> sentences = tagalog.sents() >>> words = tagalog.words() >>> tokens = nltk.word_tokenize(raw) >>> tagalog_text = nltk.Text(tokens) 12
  • 13. Spoken Dialog Systems MORPHOLOGY > Syntax > Semantics 13
  • 14. Tokenization Tokenization – breaking up of string into words and punctuations >>> tokens = nltk.word_tokenize(raw) >>> tagalog_tokens = nltk.Text(tokens) >>> tagalog_tokens = set(sample.lower() for sample in tagalog_tokens) MORPHOLOGY > Syntax > Semantics 14
  • 15. Stemming Stemming – normalize words into its base form, result may not be the 'root' word >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix): ... return word[:-len(suffix)] ... return word ... >>> stem('reading') 'read' >>> stem('moment') 'mo' MORPHOLOGY > Syntax > Semantics 15
  • 16. Lemmatization Lemmatization – uses vocabulary list and morphological analysis (uses POS of a word) >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix) and word[:-len(suffix)] in brown.words(): ... return word[:-len(suffix)] ... return word ... >>> stem('reading') 'read' >>> stem('moment') 'moment' MORPHOLOGY > Syntax > Semantics 16
  • 17. NLTK Stemmers & Lemmatizer • Porter Stemmer and Lancaster Stemmer >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> [porter.stem(w) for w in brown.words()[:100]] • Word Net Lemmatizer >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(w) for w in brown.words()[:100]] • Comparison >>> [wnl.lemmatize(w) for w in ['investigation', 'women']] >>> [porter.stem(w) for w in ['investigation', 'women']] >>> [lancaster.stem(w) for w in ['investigation', 'women']] MORPHOLOGY > Syntax > Semantics 17
  • 18. Using Regular Expression Operator . ^abc abc$ [abc] [A-Z0-9] ed|ing|s * + ? {n} {n,} {,n} {m,n} a(b|c)+ Behavior Wildcard, matches any character Matches some pattern abc at the start of a string Matches some pattern abc at the end of a string Matches one of a set of characters Matches one of a range of characters Matches one of the specified strings (disjunction) Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure) One or more of previous item, e.g. a+, [a-z]+ Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? Exactly n repeats where n is a non-negative integer At least n repeats No more than n repeats At least m and no more than n repeats Parentheses that indicate the scope of the operators MORPHOLOGY > Syntax > Semantics 18
  • 19. Using Regular Expression >>> import re >>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'reading') [('read', 'ing')] >>> def stem(word): ... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$' ... stem, suffix = re.findall(regexp, word)[0] ... return stem ... >>> stem('reading') 'read' >>> stem('moment') 'moment' MORPHOLOGY > Syntax > Semantics 19
  • 20. Spoken Dialog Systems Morphology > SYNTAX > Semantics 20
  • 21. Lexical Resources • collection of words with association information (annotation) • Ex: stopwords – high-frequency words with little lexical content >>> from nltk.corpus import stopwords >>> stopwords.words('english') >>> stopwords.words('german') MORPHOLOGY > Syntax > Semantics 21
  • 22. Part-of-Speech (POS) Tagging • the process of labeling and classifying words to a particular part of speech based on its definition and context Morphology > SYNTAX > Semantics 22
  • 23. NLTKs POS Tag Sets* – 1/2 Tag ADJ ADV CNJ DET EX FW MOD N NP Meaning adjective adverb conjunction determiner existential foreign word modal verb noun proper noun Examples new, good, high, special, big, local really, already, still, early, now and, or, but, if, while, although the, a, some, most, every, no there, there's dolce, ersatz, esprit, quo, maitre will, can, would, may, must, should year, home, costs, time, education Alison, Africa, April, Washington *simplified Morphology > SYNTAX > Semantics 23
  • 24. NLTKs POS Tag Sets* – 2/2 Tag NUM PRO P TO UH V VD VG VN WH Meaning number pronoun preposition the word to interjection verb past tense present participle past participle wh determiner Examples twenty-four, fourth, 1991, 14:24 he, their, her, its, my, I, us on, of, at, with, by, into, under to ah, bang, ha, whee, hmpf, oops is, has, get, do, make, see, run said, took, told, made, asked making, going, playing, working given, taken, begun, sung who, which, when, what, where, how *simplified Morphology > SYNTAX > Semantics 24
  • 25. NLTK POS Tagger (Brown) >>> nltk.pos_tag(brown.words()[:30]) [('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'), ('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'JJ'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBN'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'WDT'), ('any', 'DT'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The', 'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN')] >>> brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...] Morphology > SYNTAX > Semantics 25
  • 26. NLTK POS Tagger (German) >>> german = nltk.corpus.europarl_raw.german >>> nltk.pos_tag(german.words()[:30]) [(u'Wiederaufnahme', 'NNP'), (u'der', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'Ich', 'NNP'), (u'erklxe4re', 'NNP'), (u'die', 'VB'), (u'am', 'NN'), (u'Freita g', 'NNP'), (u',', ','), (u'dem', 'NN'), (u'17.', 'CD'), (u'Dezember', 'NNP'), (u' unterbrochene', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'des', 'VBZ'), (u'Eur opxe4ischen', 'JJ'), (u'Parlaments', 'NNS'), (u'fxfcr', 'JJ'), (u'wiederaufg enommen', 'NNS'), (u',', ','), (u'wxfcnsche', 'NNP'), (u'Ihnen', 'NNP'), (u' nochmals', 'NNS'), (u'alles', 'VBZ'), (u'Gute', 'NNP'), (u'zum', 'NN'), (u'Ja hreswechsel', 'NNP'), (u'und', 'NN'), (u'hoffe', 'NN'), (u',', ',')] xe4 = ä xfc = ü !!! DOES NOT WORK FOR GERMAN Morphology > SYNTAX > Semantics 26
  • 27. NLTK POS Dictionary >>> pos = nltk.defaultdict(lambda:'N') >>> pos['eat'] 'N' >>> pos.items() [('eat', 'N')] >>> for (word, tag) in brown.tagged_words(simplify_tags=True): ... if word in pos: ... if isinstance(pos[word], str): ... new_list = [pos[word]] ... pos[word] = new_list ... if tag not in pos[word]: ... pos[word].append(tag) ... else: ... pos[word] = [tag] ... >>> pos['eat'] ['N', 'V'] Morphology > SYNTAX > Semantics 27
  • 28. What else can you do with NLTK? • Other Taggers – Unigram Tagging • nltk.UnigramTagger() • train tagger using tagged sentence data – N-gram Tagging • Text classification using machine learning techniques – decision trees – naïve Bayes classification (supervised) – Markov Models Morphology > SYNTAX > SEMANTICS 28
  • 29. Gensim • Tool that extracts semantic structure of documents, by examining word statistical cooccurrence patterns within a corpus of training documents. • Algorithms: 1. Latent Semantic Analysis (LSA) 2. Latent Dirichlet Allocation (LDA) or Random Projections Morphology > Syntax > SEMANTICS 29
  • 30. Gensim • Features – memory independent – wrappers/converters for several data formats • Vector – representation of the document as an array of features or question-answer pair 1. 2. 3. (word occurrence, count) (paragraph, count) (font, count) • Model – transformation from one vector to another – learned from a training corpus without supervision Morphology > Syntax > SEMANTICS 30
  • 32. Other NLP tools for Python • TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation – https://pypi.python.org/pypi/textblob • Pattern – part-of-speech taggers, n-gram search, sentiment analysis, WordNet, machine learning – http://www.clips.ua.ac.be/pattern 32
  • 33. Star Trek technology that became a reality http://www.youtube.com/watch?v=sRZxwR IH9RI
  • 34. Installation Guides • NLTK – http://www.nltk.org/install.html – http://www.nltk.org/data.html • Gensim – http://radimrehurek.com/gensim/install.html • Palito – http://ccs.dlsu.edu.ph:8086/Palito/find_project.js p 34
  • 35. Using iPython • http://ipython.org/install.html >>> documents = ["Human machine interface for lab abc computer applications", >>> "A survey of user opinion of computer system response time", >>> "The EPS user interface management system", >>> "System and human system engineering testing of EPS", >>> "Relation of user perceived response time to error measurement", >>> "The generation of random binary unordered trees", >>> "The intersection graph of paths in trees", >>> "Graph minors IV Widths of trees and well quasi ordering", >>> "Graph minors A survey"] 35
  • 36. References • Natural Language Processing with Python By Steven Bird, Ewan Klein, Edward Loper • http://www.nltk.org/book/ • http://radimrehurek.com/gensim/tutorial.htm l 36
  • 37. Thank You! • For questions and comments: - ann at auberonsolutions dot com 37