Introduction to NLTK

Getting Started with NLTK
An Introduction to NLTK

Sreejith S
srssreejith@gmail.com
@tweet2sree

FOSSMeet 2011,NIC Calicut

06 February 2011

Sreejith S Getting Started with NLTK

Just a word about me !!

Working in Natural Language Processing (NLP), Machine Learning,
Text Mining
Active member of ilugcbe , http://ilugcbe.techstud.org
Works for 365Media Pvt. Ltd. Coimbatore India.
@tweet2sree , srssreejith@gmail.com


Introduction - NLP

Natural Language Processing


Introduction - NLP

NLP is an inter-disciplinary subject


Introduction - NLP

Computer Science


Introduction - NLP

Computer Science
Linguistics


Introduction - NLP

Computer Science
Linguistics
Statistics etc...


Introduction - NLP

Computer Science
Linguistics
Statistics etc...
NLP is a sub ﬁeld of Artiﬁcial Intelligence


Introduction - NLP

Computer Science
Linguistics
Statistics etc...
NLP - Any kind of computer manipulation of natural language.


Introduction - NLP

Computer Science
Linguistics
Statistics etc...
It is a rapidly developing ﬁeld of study


Introduction - NLP

Computer Science
Linguistics
Statistics etc...
Everyday applications of NLP


Introduction - NLP

Computer Science
Linguistics
Statistics etc...
Everyday applications of NLP
Handwriting recognition,Machine translation,Question-answering
systems,Spell checkers,Grammer checkers etc...


Natural Language Toolkit (NLTK)

A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)



(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien



(NLP)
NLTK is



(NLP)
NLTK is
Free and Open source



(NLP)
NLTK is
Easy to use



(NLP)
NLTK is
Easy to use
Modular



(NLP)
NLTK is
Easy to use
Modular
Well documented



(NLP)
NLTK is
Easy to use
Modular
Well documented
Simple and extensible



(NLP)
NLTK is
Easy to use
Modular
Well documented
Simple and extensible
http://www.nltk.org


What You Will Learn

How simple programs can help you manipulate and analyze language
data, and how to write these programs


What You Will Learn

How key concepts from NLP and linguistics are used to describe and
analyze language


What You Will Learn

analyze language
How data structures and algorithms are used in NLP


What You Will Learn

analyze language
How data structures and algorithms are used in NLP
How language data is stored in standard formats, and how data can
be used to evaluate the performance of NLP techniques


Installation of NLTK

Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system



Install Python Tkinter package



Install Numpy, Matplotlib, Prover9, MaltParse and MegaM



Download NLTK and Install it



If you are installing NLTK from source Download
http://nltk.googlecode.com/ﬁles/nltk-2.0b9.zip



Unzip it , It will create nltk-2.0b9 .



Open terminal and cd in to this folder, Be super user , python
setup.py install



setup.py install
To install data



setup.py install
To install data
Start python interpreter
>>> import nltk
>>> nltk.download()



setup.py install
To install data
>>> import nltk
>>> nltk.download()
Now you are ready to play with NLTK !!!


NLTK Modules

NLTK Modules Functionality


NLTK Modules


nltk.corpus Courpus


NLTK Modules


nltk.corpus Courpus
nltk.tokenize,nltk.stem Tokenizers,stemmers


NLTK Modules


nltk.corpus Courpus
nltk.collocations t-test,chi-squared,mutual-info


NLTK Modules


nltk.corpus Courpus
nltk.tag n-gram,backoff,Brill,HMM,TnT


NLTK Modules


nltk.corpus Courpus
nltk.classify,nltk.cluster Decision tree,Naive bayes,K-means


NLTK Modules


nltk.corpus Courpus
nltk.chunk Regex,n-gram,named entity


NLTK Modules


nltk.corpus Courpus
nltk.parsing Parsing


NLTK Modules


nltk.corpus Courpus
nltk.sem,nltk.interence Semantic interpretation


NLTK Modules


nltk.corpus Courpus
nltk.metrics Evaluation metrics


NLTK Modules


nltk.corpus Courpus
nltk.probability Probability & Estimation


NLTK Modules


nltk.corpus Courpus
nltk.probability Probability & Estimation
nltk.app,nltk.chat Applications


Let us start the game

To access data for working out the example in the book



Some basic work outs from the book



Concordance



Concordance
>>> from nltk.book import *
>>> text1.concordance("monstrous")



Concordance
Similar



Concordance
Similar
>>> text1.similar("monstrous")



Concordance
Similar
Dispersion plot - Positional information



Concordance
Similar
Dispersion plot - Positional information
>>> text4.dispersion_plot(["citizens",
"democracy", "freedom", "duties", "America"])

>>> text4.dispersion_plot(["and",
"to", "of", "with", "the"])
What is it !!! Why ???


Continued...



Continued...

Generate


Continued...

Generate
>>> text3.generate()


Continued...

Generate
Counting Vocabulary


Continued...

Generate
Counting Vocabulary
>>> len(text3)


Continued...

Generate
Counting Vocabulary
>>> len(text3)
List of distinct words ,sorted in dictionary order.


Continued...

Generate
Counting Vocabulary
>>> len(text3)
>>> sorted(set(text3))


Continued...

Generate
Counting Vocabulary
>>> len(text3)
Count occurrence of a particular word in a text


Continued...

Generate
Counting Vocabulary
>>> len(text3)
Count occurrence of a particular word in a text
>>> text3.count("and")

What percentage of text it is taken by a specific word
>>> 100 * text3.count("and") / len(text3)


Collocation & Bigram



Collocation
A collocation is a sequence of words that occur together unusually often
e.g :- red wine , strong tea
But strong computer is not a collocation



Collocation
>>> text4.collocations()



Collocation

Bigrams
List of word pairs



Collocation

Bigrams
List of word pairs
>>> text = "sreejith is talking about NLTK"
>>> wordlist = text.split()
>>> bigrams(wordlist)



Collocation

Bigrams
List of word pairs
>>> wordlist = text.split()
>>> bigrams(wordlist)
what will happen if i do like this
>>> bigrams(text)


Work with our own data

Populate our own corpora with NLTK and analyse it



>>> from nltk.corpus import
PlaintextCorpusReader as ptr
>>> corpus = ’/home/developer/Desktop/Sreejith’
>>> wordlist = ptr(corpus,’.*’)
>>> wordlist.fileids()



Let us try to ﬁnd it out how to count number of characters, words
and sentences in the corpus



Let us try to ﬁnd it out how to count number of characters, words
and sentences in the corpus
>>> for fid in wordlist.fileids():
print len(wordlist.raw(fid))
print len(wordlist.words(fid))

print len(wordlist.sents(fid))


Continued...

Ploting conditional frquency distribution


Continued...

>>> words = text.split()
>>> big = bigrams(words)
>>> gd = nltk.ConditionalFreqDist(big)
>>> gd.plot()


Continued...

>>> gd.plot()
Tabulate CFD


Continued...

>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()


Continued...

>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Plot frequency distribution


Continued...

>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Plot frequency distribution
>>> fdist = FreqDist(text1)
>>> fdist.plot(50,cumulative=True)


Normalizing Text


Normalizing Text

Stemming
Stemming is the process for reducing inﬂected (or sometimes derived)
words to their stem, base or root form , generally a written word form


Normalizing Text

Stemming
Stemming is the process for reducing inﬂected (or sometimes derived)
words to their stem, base or root form , generally a written word form
>>> porter = nltk.PorterStemmer()
>>> word = ’running’
>>> porter.stem(word)

>>> lancaster = nltk.LancasterStemmer()
>>> lancaster.stem(tok[2])


Normalizing Text

Lemmatization
Stemming + make sure that the resulting form is a known word in a
dictionary


Normalizing Text

Lemmatization
Stemming + make sure that the resulting form is a known word in a
dictionary
>>> wnl = nltk.WordNetLemmatizer()
>>> wnl.lemmatize(word)


POS Tagging


POS Tagging

POS Tagging
The process of classifying words into their parts-of-speech and labeling
them accordingly is known as part-of-speech tagging, POS tagging


POS Tagging

POS Tagging
The process of classifying words into their parts-of-speech and labeling
them accordingly is known as part-of-speech tagging, POS tagging
>>> text = nltk.word_tokenize("we are attending
FOSS meet at NIC calicut")
>>> nltk.pos_tag(text)


Parsing


Parsing

Sentence Parsing
Analyzing sentence structures and create a Parse Tree


Parsing

Sentence Parsing
Analyzing sentence structures and create a Parse Tree

>>> sentence = [("the", "DT"), ("little", "JJ"),
("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"),
("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> grammar = "NP: {<DT>?<JJ>*<NN>}"
>>> cp = nltk.RegexpParser(grammar)
>>> result = cp.parse(sentence)
>>> print result
>>> result.draw()


Machine Translation


Machine Translation

Babelizer Shell
Translating a sentence from its source langauge to a speciﬁed language.
NLTK provides babelize shell


Machine Translation

Babelizer Shell
>>> babelize_shell()
Babel> hello how are you?
Babel> german
Babel> run


Machine Translation

Babelizer Shell
>>> babelize_shell()
Babel> hello how are you?
Babel> german
Babel> run

Just try Google Translator, Yahoo babelﬁsh


What u can do??

Contribute to NLTK
GSOC
NLP Training
Real time research


Reference

Steven Bird, Edvard Loper and Ewan Klien
Natural Language Processing with Python
Jacob Perkins
Python Text Processing with NLTK2.0 Cookbook
http://www.nltk.org


Questions


And ﬁnally...

Sreejith.S


Introduction to NLTK

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to NLTK

Similar to Introduction to NLTK (20)

Recently uploaded

Recently uploaded (20)

Introduction to NLTK