SlideShare a Scribd company logo
Natural Language Toolkit [NLTK] 
Prakash B Pimpale 
pbpimpale@gmail.com 
@ 
FOSS(From the Open Source Shelf) 
An open source softwares seminar series 
(CC) KBCS CDAC MUMBAI
(CC) KBCS CDAC MUMBAI 
Let's start 
➢ What is it about? 
➢ NLTK – installing and using it for some 'basic NLP' 
and Text Processing tasks 
➢ Some NLP and Some python 
➢ How will we proceed ? 
➢ NLTK introduction 
➢ NLP introduction 
➢ NLTK installation 
➢ Playing with text using NLTK 
➢ Conclusion
(CC) KBCS CDAC MUMBAI 
NLTK 
➢ Open source python modules and linguistic 
data for Natural Langauge Processing 
application 
➢ Dveloped by group of people, project leaders: 
Steven Bird, Edward Loper, Ewan Klein 
➢ Python: simple, free and open source, portable, 
object oriented, interpreted
(CC) KBCS CDAC MUMBAI 
NLP 
➢ Natural Language Processing: field of computer science 
and linguistics woks out interactions between computers 
and human (natural) languages. 
➢ some brand ambassadors: machine translation, 
automatic summarization, information extraction, 
transliteration, question answering, opinion mining 
➢ Basic tasks: stemming, POS tagging, chunking, parsing, 
etc. 
➢ Involves simple frequency count to undertanding and 
generating complex text
Terms/Steps in NLP 
➢ Tokenization: getting words and punctuations 
out from text 
➢ Stemming: getting the (inflectional) root of word; 
plays, playing, played : play 
(CC) KBCS CDAC MUMBAI
(CC) KBCS CDAC MUMBAI 
cont.. 
POS(Part of Speech) tagging: 
Ram NNP 
killed VBD 
Ravana NNP 
POS tags: NNP­proper 
noun 
VBD verb, past tense 
(Penn tree bank tagset )
(CC) KBCS CDAC MUMBAI 
cont.. 
➢ Chunking: groups the similar POS tags (liberal 
def)
cont.. 
Parsing: Identifying gramatical structures in a 
sentence: Mary saw a dog 
*(S 
(NP Mary) 
**(VP 
(V saw) 
***(NP 
(Det a) 
(N dog) )*** )** )* 
(CC) KBCS CDAC MUMBAI
(CC) KBCS CDAC MUMBAI 
Applications 
➢ Machine Translation 
MaTra, Google translate 
➢ Transliteration 
Xlit, Google Indic 
➢ Text classification 
Spam filters 
many more.......!
(CC) KBCS CDAC MUMBAI 
Challenges 
➢ Word sense disambiguation 
Bank(financial/river) 
Plant(industrial/natural) 
➢ Name entity recognition 
CDAC is in Mumbai 
➢ Clause Boundary detection 
Ram who was the king of Ayodhya killed 
Ravana 
and many more.....!
Installing NLTK 
Windows 
➢ Install Python: Version 2.4.*, 2.5.*, or 2.6.* and 
not 3.0 
available http://www.nltk.org/download 
[great instructions!] 
(CC) KBCS CDAC MUMBAI 
➢ Install PyYAML: 
➢ Install Numpy 
➢ Install Matplotlib 
➢ Install NLTK
(CC) KBCS CDAC MUMBAI 
cont.. 
Linux/Unix 
➢ Most likely python will be there (if not­> 
package manager) 
➢ Get python­nympy 
and python­scipy 
packages 
(scintific library for multidimentional array and 
linear algebra), use package manager to install 
or 
'sudo apt­get 
install python­numpy 
python­scipy'
(CC) KBCS CDAC MUMBAI 
cont.. 
NLTK Source Installation 
➢ Download NLTK source 
( http://nltk.googlecode.com/files/nltk­2.0b8. 
zip) 
➢ Unzip it 
➢ Go to the new unziped folder 
➢ Just do it! 
'sudo python setup.py install' 
Done :­) 
!
Installing Data 
➢ We need data to experiment with 
It's too simple (same for all platform)... 
$python #(IDLE) enter to python prompt 
>>> import nltk 
>>> nltk.download() 
select any one option that you like/need 
(CC) KBCS CDAC MUMBAI
(CC) KBCS CDAC MUMBAI 
cont..
(CC) KBCS CDAC MUMBAI 
cont.. 
Test installation 
( set path if needed: as 
export NLTK_DATA = '/home/....../nltk_data') 
>>> from nltk.corpus import brown 
>>> brown.words()[0:20] 
['The', 'Fulton', 'County', 'Grand', 'Jury'] 
All set!
Playing with text 
Feeling python 
$python 
>>>print 'Hello world' 
>>>2+3+4 
(CC) KBCS CDAC MUMBAI 
Okay !
(CC) KBCS CDAC MUMBAI 
cont.. 
Data from book: 
>>>from nltk.book import * 
see what is it? 
>>>text1 
Search the word: 
>>> text1.concordance("monstrous") 
Find similar words 
>>> text1.similar("monstrous")
cont.. 
Find common context 
>>>text2.common_contexts(["monstrous", "very"]) 
Find position of the word in entire text 
>>>text4.dispersion_plot(["citizens", "democracy", 
"freedom", "duties", "America"]) 
Counting vocab 
>>>len(text4) #gives tokens(words and 
punctuations) 
(CC) KBCS CDAC MUMBAI
cont.. 
Vocabulary(unique words) used by author 
>>>len(set(text1)) 
See sorted vocabulary(upper case first!) 
>>>sorted(set(text1)) 
Measuring richness of text(average repetition) 
>>>len(text1)/len(set(text1)) 
(before division import for floating point division 
from __future__ import division) 
(CC) KBCS CDAC MUMBAI
(CC) KBCS CDAC MUMBAI 
cont.. 
Counting occurances 
>>>text5.count("lol") 
% of the text taken 
>>> 100 * text5.count('a') / len(text5) 
let's write function for it in python 
>>>def prcntg_oftext(word,text): 
return100*.count(wrd)/len(txt)
cont.. 
Writing our own text 
>>>textown =['a','text','is','a','list','of','words'] 
adding texts(lists, can contain anything) 
>>>sent = ['pqr'] 
>>>sent2 = ['xyz'] 
>>>sent+sent2 
append to text 
>>>sent.append('abc') 
(CC) KBCS CDAC MUMBAI
(CC) KBCS CDAC MUMBAI 
cont.. 
Indexing text 
>>> text4[173] 
or 
>>>text4.index('awaken') 
slicing 
text1[3:4], text[:5], text[10:] try it! (notice ending 
index one less than specified)
(CC) KBCS CDAC MUMBAI 
cont.. 
Strings in python 
>>>name = 'astring' 
>>>name[0] 
what's more 
>>>name*2 
>>>'SPACE '.join(['let us','join'])
(CC) KBCS CDAC MUMBAI 
cont.. 
Statistics to text 
Frequency Distribution 
>>> fdist1 = FreqDist(text1) 
>>> fdist1 
>>> vocabulary1 = fdist1.keys() 
>>>vocabulary1[:25] 
let's plot it... 
>>>fdist1.plot(50, cumulative=True)
(CC) KBCS CDAC MUMBAI 
cont.. 
*what the text1 is about? 
*are all the words meaningful? 
Let's find long words i.e. 
Words from text1 with length more than 10 
in python 
>>>longw=[word for word in text1 if len(word)>10] 
let's see it... 
>>>sorted(longw)
(CC) KBCS CDAC MUMBAI 
cont.. 
*most long words have small freq and are not 
major topic of text many times 
*words with long length and certain freq may 
help to identify text topic 
>>>sorted([w for w in set(text5) if len(w) > 7 and 
fdist5[w] > 7])
cont.. 
Bigrams and collocations: 
*Bigrams 
>>>sent=['near','river', 'bank'] 
>>>bigrams(sent) 
basic to many NLP apps; transliteration, 
translation 
*collocations: frequent bigrams of rare words 
>>>text4.collocations() 
basic to some WSD approaches 
(CC) KBCS CDAC MUMBAI
cont.. 
FreDist functions: 
(CC) KBCS CDAC MUMBAI
cont.. 
Python Frequntly needed 'word' functions 
(CC) KBCS CDAC MUMBAI
(CC) KBCS CDAC MUMBAI 
cont 
Bigger probs: 
WSD, Anaphora resolution, Machine translation 
try and laugh: 
>>>babelize_shell() 
>>>german 
>>>run 
or try it here: 
http://translationparty.com/#6917644
cont.. 
Playing with text corpora 
Available corporas: Gutenberg, Web and chat 
text, Brown, Reuters, inaugural address, etc. 
>>> from nltk.corpus import inaugural 
>>> inaugural.fileids() 
>>>fileid[:4] for fileid in 
inaugural.fileids()] 
(CC) KBCS CDAC MUMBAI
(CC) KBCS CDAC MUMBAI 
cont.. 
>>> cfd = nltk.ConditionalFreqDist( 
(target, file[:4]) 
for fileid in inaugural.fileids() 
for w in inaugural.words(fileid) 
for target in ['america', 'citizen'] 
if w.lower().startswith(target)) 
>>> cfd.plot() 
Conditional Freq distribution of words 'america' and 
'citizen' with different years
cont.. 
Generating text (applying bigrams) 
>>> def generate_model(cfdist, word, num=15): 
for i in range(num): 
print word, 
word = cfdist[word].max() 
(CC) KBCS CDAC MUMBAI 
­­­>>> 
text = nltk.corpus.genesis.words('english­kjv. 
txt') 
>>>bigrams = nltk.bigrams(text) 
>>>cfd = nltk.ConditionalFreqDist(bigrams) 
>>print cfd['living'] 
>>generate_model(cfd, 'living')
(CC) KBCS CDAC MUMBAI 
cont.. 
WordList corpora 
>>>def unusual_words(text): 
text_vocab = set(w.lower() for w in text if 
w.isalpha()) 
english_vocab = set(w.lower() for w in 
nltk.corpus.words.words()) 
unusual = text_vocab.difference(english_vocab) 
return sorted(unusual) 
­­>>> 
unusual_words(nltk.corpus.gutenberg.words('aust 
en­sense. 
txt')) 
Are the unusual or misspelled words.
(CC) KBCS CDAC MUMBAI 
cont.. 
Stop words: 
>>> from nltk.corpus import stopwords 
>>> stopwords.words('english') 
­­Percentage 
of content 
>>> def content_fraction(text): 
stopwords = nltk.corpus.stopwords.words('english') 
content = [w for w in text if w.lower() not in stopwords] 
return len(content) / len(text) 
>>> content_fraction(nltk.corpus.reuters.words()) 
is the useful fraction of useful words!
(CC) KBCS CDAC MUMBAI 
cont.. 
WordNet: structured, semanticaly orinted 
dictionary 
>>> from nltk.corpus import wordnet as wn 
get synset/collection of synonyms 
>>> wn.synsets('motorcar') 
now get synonyms(lemmas) 
>>> wn.synset('car.n.01').lemma_names
(CC) KBCS CDAC MUMBAI 
cont.. 
What does car.n.01 actually mean? 
>>> wn.synset('car.n.01').definition 
An example: 
>>> wn.synset('car.n.01').examples 
Also 
>>>wn.lemma('supply.n.02.supply').antonyms() 
Useful data in many applications like WSD, 
understanding text, question answering, etc.
(CC) KBCS CDAC MUMBAI 
cont.. 
Text from web: 
>>> from urllib import urlopen 
>>> url = 
"http://www.gutenberg.org/files/2554/2554.txt" 
>>> raw = urlopen(url).read() 
>>> type(raw) 
>>> len(raw) 
is in characters...!
(CC) KBCS CDAC MUMBAI 
cont.. 
Tokenization: 
>>> tokens = nltk.word_tokenize(raw) 
>>> type(tokens) 
>>> len(tokens) 
>>> tokens[:15] 
convert to Text 
>>> text = nltk.Text(tokens) 
>>> type(text) 
Now you can do all above things with this!
cont.. 
What about HTML? 
>>> url = 
"http://news.bbc.co.uk/2/hi/health/2284783.stm" 
>>> html = urlopen(url).read() 
>>> html[:60] 
It's HTML; Clean the tags... 
>>> raw = nltk.clean_html(html) 
>>> tokens = nltk.word_tokenize(raw) 
>>> tokens[:60] 
>>> text = nltk.Text(tokens) 
(CC) KBCS CDAC MUMBAI
(CC) KBCS CDAC MUMBAI 
cont.. 
Try it yourself 
Processing RSS feeds using universal feed parser ( 
http://feedparser.org/) 
Opening text from local machine: 
>>> f = open('document.txt') 
>>> raw = f.read() 
(make sure you are in current directory) 
then tokenize and bring to Text and play with it
cont.. 
Capture user input: 
>>> s = raw_input("Enter some text: ") 
>>> print "You typed", len(nltk.word_tokenize(s)), 
"words." 
Writing O/p to files 
>>> output_file = open('output.txt', 'w') 
>>> words = set(nltk.corpus.genesis.words('english­kjv. 
(CC) KBCS CDAC MUMBAI 
txt')) 
>>> for word in sorted(words): 
output_file.write(word + "n")
cont.. 
Stemming : NLTK has inbuilt stemmers 
>>>text =['apples','called','indians','applying'] 
>>> porter = nltk.PorterStemmer() 
>>> lancaster = nltk.LancasterStemmer() 
>>>[porter.stem(p) for p in text] 
>>>[lancaster.stem(p) for p in text] 
Lemmatization: WordNet lemmetizer, removes 
affixes only if the new word exist in dictionary 
>>> wnl = nltk.WordNetLemmatizer() 
>>> [wnl.lemmatize((CC) KBCS t) for CDAC t MUMBAI 
in text]
(CC) KBCS CDAC MUMBAI 
cont.. 
Using a POS tagger: 
>>> text = nltk.word_tokenize("And now for 
something completely different") 
Query the POS tag documentation as: 
>>>nltk.help.upenn_tagset('NNP') 
Parsing with NLTK 
an example
cont.. 
Classifying with NLTK: 
Problem: Gender identification from name 
Feature last letter 
>>> def gender_features(word): 
return {'last_letter': word[­1]} 
(CC) KBCS CDAC MUMBAI 
Get data 
>>> from nltk.corpus import names 
>>> import random
(CC) KBCS CDAC MUMBAI 
cont.. 
>>> names = ([(name, 'male') for name in 
names.words('male.txt')] + 
[(name, 'female') for name in 
names.words('female.txt')]) 
>>> random.shuffle(names) 
Get Feature set 
>>> featuresets = [(gender_features(n), g) for (n,g) in 
names] 
>>> train_set, test_set = featuresets[500:], featuresets[:500] 
Train 
>>> classifier = nltk.NaiveBayesClassifier.train(train_set) 
Test 
>>> classifier.classify(gender_features('Neo'))
(CC) KBCS CDAC MUMBAI 
Conclusion 
NLTK is rich resource for starting with Natural 
Language Processing and it's being open 
source and based on python adds feathers to 
it's cap!
(CC) KBCS CDAC MUMBAI 
References 
➢ http://www.nltk.org/ 
➢ http://www.python.org/ 
➢ http://wikipedia.org/ 
(all accessed latest on 19th March 2010) 
➢ Natural Language Processing with Python, Steven Bird, Ewan Klein, and 
Edward Loper, O'REILLY
Thank you 
#More on NLP @ http://matra2.blogspot.com/ 
#Slides were prepared for 'supporting' the talk and are distributed as it is under CC 3.0 
(CC) KBCS CDAC MUMBAI

More Related Content

What's hot

Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 
Austin Bingham. Python Refactoring. PyCon Belarus
Austin Bingham. Python Refactoring. PyCon BelarusAustin Bingham. Python Refactoring. PyCon Belarus
Austin Bingham. Python Refactoring. PyCon Belarus
Alina Dolgikh
 
服务框架: Thrift & PasteScript
服务框架: Thrift & PasteScript服务框架: Thrift & PasteScript
服务框架: Thrift & PasteScriptQiangning Hong
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
source{d}
 
Text analysis using python
Text analysis using pythonText analysis using python
Text analysis using python
Vijay Ramachandran
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
lichtkind
 
Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017 Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017
pramode_ce
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Jimmy Lai
 
Java Full Throttle
Java Full ThrottleJava Full Throttle
Java Full Throttle
José Paumard
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
KHNOG
 
The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180
Mahmoud Samir Fayed
 
Namespace1
Namespace1Namespace1
Namespace1
zindadili
 
The Ring programming language version 1.8 book - Part 45 of 202
The Ring programming language version 1.8 book - Part 45 of 202The Ring programming language version 1.8 book - Part 45 of 202
The Ring programming language version 1.8 book - Part 45 of 202
Mahmoud Samir Fayed
 
Leap Ahead with Redis 6.2
Leap Ahead with Redis 6.2Leap Ahead with Redis 6.2
Leap Ahead with Redis 6.2
VMware Tanzu
 
Java nio ( new io )
Java nio ( new io )Java nio ( new io )
Java nio ( new io )
Jemin Patel
 
Memory Management In Python The Basics
Memory Management In Python The BasicsMemory Management In Python The Basics
Memory Management In Python The Basics
Nina Zakharenko
 
Unix lab
Unix labUnix lab
Netty: asynchronous data transfer
Netty: asynchronous data transferNetty: asynchronous data transfer
Netty: asynchronous data transfer
Victor Cherkassky
 

What's hot (20)

Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Austin Bingham. Python Refactoring. PyCon Belarus
Austin Bingham. Python Refactoring. PyCon BelarusAustin Bingham. Python Refactoring. PyCon Belarus
Austin Bingham. Python Refactoring. PyCon Belarus
 
服务框架: Thrift & PasteScript
服务框架: Thrift & PasteScript服务框架: Thrift & PasteScript
服务框架: Thrift & PasteScript
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Text analysis using python
Text analysis using pythonText analysis using python
Text analysis using python
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017 Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Java Full Throttle
Java Full ThrottleJava Full Throttle
Java Full Throttle
 
Apache Thrift
Apache ThriftApache Thrift
Apache Thrift
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180
 
abu.rpc intro
abu.rpc introabu.rpc intro
abu.rpc intro
 
Namespace1
Namespace1Namespace1
Namespace1
 
The Ring programming language version 1.8 book - Part 45 of 202
The Ring programming language version 1.8 book - Part 45 of 202The Ring programming language version 1.8 book - Part 45 of 202
The Ring programming language version 1.8 book - Part 45 of 202
 
Leap Ahead with Redis 6.2
Leap Ahead with Redis 6.2Leap Ahead with Redis 6.2
Leap Ahead with Redis 6.2
 
Java nio ( new io )
Java nio ( new io )Java nio ( new io )
Java nio ( new io )
 
Memory Management In Python The Basics
Memory Management In Python The BasicsMemory Management In Python The Basics
Memory Management In Python The Basics
 
Unix lab
Unix labUnix lab
Unix lab
 
Netty: asynchronous data transfer
Netty: asynchronous data transferNetty: asynchronous data transfer
Netty: asynchronous data transfer
 

Similar to NLTK introduction

Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
Parm
ParmParm
Parm
parmsidhu
 
Parm
ParmParm
Parm
parmsidhu
 
Modern C++
Modern C++Modern C++
Modern C++
Michael Clark
 
cs262_intro_slides.pptx
cs262_intro_slides.pptxcs262_intro_slides.pptx
cs262_intro_slides.pptx
AnaLoreto13
 
Scalable Cloud Solutions with Node.js
Scalable Cloud Solutions with Node.jsScalable Cloud Solutions with Node.js
Scalable Cloud Solutions with Node.js
mpneuried
 
Seastar @ NYCC++UG
Seastar @ NYCC++UGSeastar @ NYCC++UG
Seastar @ NYCC++UG
Avi Kivity
 
Pycon - Python for ethical hackers
Pycon - Python for ethical hackers Pycon - Python for ethical hackers
Pycon - Python for ethical hackers
Mohammad Reza Kamalifard
 
NativeBoost
NativeBoostNativeBoost
NativeBoost
ESUG
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStack
David Sanchez
 
NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
Apaichon Punopas
 
Dot netsupport in alpha five v11 coming soon
Dot netsupport in alpha five v11 coming soonDot netsupport in alpha five v11 coming soon
Dot netsupport in alpha five v11 coming soonRichard Rabins
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable CodeBaidu, Inc.
 
JIT compilation for CPython
JIT compilation for CPythonJIT compilation for CPython
JIT compilation for CPython
delimitry
 
WCF and WF in Framework 3.5
WCF and WF in Framework 3.5WCF and WF in Framework 3.5
WCF and WF in Framework 3.5
ukdpe
 
Seastar @ SF/BA C++UG
Seastar @ SF/BA C++UGSeastar @ SF/BA C++UG
Seastar @ SF/BA C++UG
Avi Kivity
 
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun..."ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
Julia Cherniak
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
Rebecca Bilbro
 

Similar to NLTK introduction (20)

Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Parm
ParmParm
Parm
 
Parm
ParmParm
Parm
 
Modern C++
Modern C++Modern C++
Modern C++
 
cs262_intro_slides.pptx
cs262_intro_slides.pptxcs262_intro_slides.pptx
cs262_intro_slides.pptx
 
Scalable Cloud Solutions with Node.js
Scalable Cloud Solutions with Node.jsScalable Cloud Solutions with Node.js
Scalable Cloud Solutions with Node.js
 
Seastar @ NYCC++UG
Seastar @ NYCC++UGSeastar @ NYCC++UG
Seastar @ NYCC++UG
 
Pycon - Python for ethical hackers
Pycon - Python for ethical hackers Pycon - Python for ethical hackers
Pycon - Python for ethical hackers
 
NativeBoost
NativeBoostNativeBoost
NativeBoost
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStack
 
NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
 
Dot netsupport in alpha five v11 coming soon
Dot netsupport in alpha five v11 coming soonDot netsupport in alpha five v11 coming soon
Dot netsupport in alpha five v11 coming soon
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
 
Fp201 unit2 1
Fp201 unit2 1Fp201 unit2 1
Fp201 unit2 1
 
JIT compilation for CPython
JIT compilation for CPythonJIT compilation for CPython
JIT compilation for CPython
 
WCF and WF in Framework 3.5
WCF and WF in Framework 3.5WCF and WF in Framework 3.5
WCF and WF in Framework 3.5
 
Seastar @ SF/BA C++UG
Seastar @ SF/BA C++UGSeastar @ SF/BA C++UG
Seastar @ SF/BA C++UG
 
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun..."ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
OOPSLA Talk on Preon
OOPSLA Talk on PreonOOPSLA Talk on Preon
OOPSLA Talk on Preon
 

More from Prakash Pimpale

Data Analytics Webinar for Aspirants
Data Analytics Webinar for AspirantsData Analytics Webinar for Aspirants
Data Analytics Webinar for Aspirants
Prakash Pimpale
 
Data Science - a brief keynote
Data Science - a brief keynoteData Science - a brief keynote
Data Science - a brief keynote
Prakash Pimpale
 
Technology Entrepreneurship for Students
Technology Entrepreneurship for StudentsTechnology Entrepreneurship for Students
Technology Entrepreneurship for Students
Prakash Pimpale
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
Prakash Pimpale
 
Collaboration tools in education
Collaboration tools in educationCollaboration tools in education
Collaboration tools in education
Prakash Pimpale
 
Entrepreneurship and Startups - Introduction
Entrepreneurship and Startups - IntroductionEntrepreneurship and Startups - Introduction
Entrepreneurship and Startups - Introduction
Prakash Pimpale
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
Prakash Pimpale
 

More from Prakash Pimpale (7)

Data Analytics Webinar for Aspirants
Data Analytics Webinar for AspirantsData Analytics Webinar for Aspirants
Data Analytics Webinar for Aspirants
 
Data Science - a brief keynote
Data Science - a brief keynoteData Science - a brief keynote
Data Science - a brief keynote
 
Technology Entrepreneurship for Students
Technology Entrepreneurship for StudentsTechnology Entrepreneurship for Students
Technology Entrepreneurship for Students
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
Collaboration tools in education
Collaboration tools in educationCollaboration tools in education
Collaboration tools in education
 
Entrepreneurship and Startups - Introduction
Entrepreneurship and Startups - IntroductionEntrepreneurship and Startups - Introduction
Entrepreneurship and Startups - Introduction
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 

Recently uploaded

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 

Recently uploaded (20)

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 

NLTK introduction

  • 1. Natural Language Toolkit [NLTK] Prakash B Pimpale pbpimpale@gmail.com @ FOSS(From the Open Source Shelf) An open source softwares seminar series (CC) KBCS CDAC MUMBAI
  • 2. (CC) KBCS CDAC MUMBAI Let's start ➢ What is it about? ➢ NLTK – installing and using it for some 'basic NLP' and Text Processing tasks ➢ Some NLP and Some python ➢ How will we proceed ? ➢ NLTK introduction ➢ NLP introduction ➢ NLTK installation ➢ Playing with text using NLTK ➢ Conclusion
  • 3. (CC) KBCS CDAC MUMBAI NLTK ➢ Open source python modules and linguistic data for Natural Langauge Processing application ➢ Dveloped by group of people, project leaders: Steven Bird, Edward Loper, Ewan Klein ➢ Python: simple, free and open source, portable, object oriented, interpreted
  • 4. (CC) KBCS CDAC MUMBAI NLP ➢ Natural Language Processing: field of computer science and linguistics woks out interactions between computers and human (natural) languages. ➢ some brand ambassadors: machine translation, automatic summarization, information extraction, transliteration, question answering, opinion mining ➢ Basic tasks: stemming, POS tagging, chunking, parsing, etc. ➢ Involves simple frequency count to undertanding and generating complex text
  • 5. Terms/Steps in NLP ➢ Tokenization: getting words and punctuations out from text ➢ Stemming: getting the (inflectional) root of word; plays, playing, played : play (CC) KBCS CDAC MUMBAI
  • 6. (CC) KBCS CDAC MUMBAI cont.. POS(Part of Speech) tagging: Ram NNP killed VBD Ravana NNP POS tags: NNP­proper noun VBD verb, past tense (Penn tree bank tagset )
  • 7. (CC) KBCS CDAC MUMBAI cont.. ➢ Chunking: groups the similar POS tags (liberal def)
  • 8. cont.. Parsing: Identifying gramatical structures in a sentence: Mary saw a dog *(S (NP Mary) **(VP (V saw) ***(NP (Det a) (N dog) )*** )** )* (CC) KBCS CDAC MUMBAI
  • 9. (CC) KBCS CDAC MUMBAI Applications ➢ Machine Translation MaTra, Google translate ➢ Transliteration Xlit, Google Indic ➢ Text classification Spam filters many more.......!
  • 10. (CC) KBCS CDAC MUMBAI Challenges ➢ Word sense disambiguation Bank(financial/river) Plant(industrial/natural) ➢ Name entity recognition CDAC is in Mumbai ➢ Clause Boundary detection Ram who was the king of Ayodhya killed Ravana and many more.....!
  • 11. Installing NLTK Windows ➢ Install Python: Version 2.4.*, 2.5.*, or 2.6.* and not 3.0 available http://www.nltk.org/download [great instructions!] (CC) KBCS CDAC MUMBAI ➢ Install PyYAML: ➢ Install Numpy ➢ Install Matplotlib ➢ Install NLTK
  • 12. (CC) KBCS CDAC MUMBAI cont.. Linux/Unix ➢ Most likely python will be there (if not­> package manager) ➢ Get python­nympy and python­scipy packages (scintific library for multidimentional array and linear algebra), use package manager to install or 'sudo apt­get install python­numpy python­scipy'
  • 13. (CC) KBCS CDAC MUMBAI cont.. NLTK Source Installation ➢ Download NLTK source ( http://nltk.googlecode.com/files/nltk­2.0b8. zip) ➢ Unzip it ➢ Go to the new unziped folder ➢ Just do it! 'sudo python setup.py install' Done :­) !
  • 14. Installing Data ➢ We need data to experiment with It's too simple (same for all platform)... $python #(IDLE) enter to python prompt >>> import nltk >>> nltk.download() select any one option that you like/need (CC) KBCS CDAC MUMBAI
  • 15. (CC) KBCS CDAC MUMBAI cont..
  • 16. (CC) KBCS CDAC MUMBAI cont.. Test installation ( set path if needed: as export NLTK_DATA = '/home/....../nltk_data') >>> from nltk.corpus import brown >>> brown.words()[0:20] ['The', 'Fulton', 'County', 'Grand', 'Jury'] All set!
  • 17. Playing with text Feeling python $python >>>print 'Hello world' >>>2+3+4 (CC) KBCS CDAC MUMBAI Okay !
  • 18. (CC) KBCS CDAC MUMBAI cont.. Data from book: >>>from nltk.book import * see what is it? >>>text1 Search the word: >>> text1.concordance("monstrous") Find similar words >>> text1.similar("monstrous")
  • 19. cont.. Find common context >>>text2.common_contexts(["monstrous", "very"]) Find position of the word in entire text >>>text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) Counting vocab >>>len(text4) #gives tokens(words and punctuations) (CC) KBCS CDAC MUMBAI
  • 20. cont.. Vocabulary(unique words) used by author >>>len(set(text1)) See sorted vocabulary(upper case first!) >>>sorted(set(text1)) Measuring richness of text(average repetition) >>>len(text1)/len(set(text1)) (before division import for floating point division from __future__ import division) (CC) KBCS CDAC MUMBAI
  • 21. (CC) KBCS CDAC MUMBAI cont.. Counting occurances >>>text5.count("lol") % of the text taken >>> 100 * text5.count('a') / len(text5) let's write function for it in python >>>def prcntg_oftext(word,text): return100*.count(wrd)/len(txt)
  • 22. cont.. Writing our own text >>>textown =['a','text','is','a','list','of','words'] adding texts(lists, can contain anything) >>>sent = ['pqr'] >>>sent2 = ['xyz'] >>>sent+sent2 append to text >>>sent.append('abc') (CC) KBCS CDAC MUMBAI
  • 23. (CC) KBCS CDAC MUMBAI cont.. Indexing text >>> text4[173] or >>>text4.index('awaken') slicing text1[3:4], text[:5], text[10:] try it! (notice ending index one less than specified)
  • 24. (CC) KBCS CDAC MUMBAI cont.. Strings in python >>>name = 'astring' >>>name[0] what's more >>>name*2 >>>'SPACE '.join(['let us','join'])
  • 25. (CC) KBCS CDAC MUMBAI cont.. Statistics to text Frequency Distribution >>> fdist1 = FreqDist(text1) >>> fdist1 >>> vocabulary1 = fdist1.keys() >>>vocabulary1[:25] let's plot it... >>>fdist1.plot(50, cumulative=True)
  • 26. (CC) KBCS CDAC MUMBAI cont.. *what the text1 is about? *are all the words meaningful? Let's find long words i.e. Words from text1 with length more than 10 in python >>>longw=[word for word in text1 if len(word)>10] let's see it... >>>sorted(longw)
  • 27. (CC) KBCS CDAC MUMBAI cont.. *most long words have small freq and are not major topic of text many times *words with long length and certain freq may help to identify text topic >>>sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])
  • 28. cont.. Bigrams and collocations: *Bigrams >>>sent=['near','river', 'bank'] >>>bigrams(sent) basic to many NLP apps; transliteration, translation *collocations: frequent bigrams of rare words >>>text4.collocations() basic to some WSD approaches (CC) KBCS CDAC MUMBAI
  • 29. cont.. FreDist functions: (CC) KBCS CDAC MUMBAI
  • 30. cont.. Python Frequntly needed 'word' functions (CC) KBCS CDAC MUMBAI
  • 31. (CC) KBCS CDAC MUMBAI cont Bigger probs: WSD, Anaphora resolution, Machine translation try and laugh: >>>babelize_shell() >>>german >>>run or try it here: http://translationparty.com/#6917644
  • 32. cont.. Playing with text corpora Available corporas: Gutenberg, Web and chat text, Brown, Reuters, inaugural address, etc. >>> from nltk.corpus import inaugural >>> inaugural.fileids() >>>fileid[:4] for fileid in inaugural.fileids()] (CC) KBCS CDAC MUMBAI
  • 33. (CC) KBCS CDAC MUMBAI cont.. >>> cfd = nltk.ConditionalFreqDist( (target, file[:4]) for fileid in inaugural.fileids() for w in inaugural.words(fileid) for target in ['america', 'citizen'] if w.lower().startswith(target)) >>> cfd.plot() Conditional Freq distribution of words 'america' and 'citizen' with different years
  • 34. cont.. Generating text (applying bigrams) >>> def generate_model(cfdist, word, num=15): for i in range(num): print word, word = cfdist[word].max() (CC) KBCS CDAC MUMBAI ­­­>>> text = nltk.corpus.genesis.words('english­kjv. txt') >>>bigrams = nltk.bigrams(text) >>>cfd = nltk.ConditionalFreqDist(bigrams) >>print cfd['living'] >>generate_model(cfd, 'living')
  • 35. (CC) KBCS CDAC MUMBAI cont.. WordList corpora >>>def unusual_words(text): text_vocab = set(w.lower() for w in text if w.isalpha()) english_vocab = set(w.lower() for w in nltk.corpus.words.words()) unusual = text_vocab.difference(english_vocab) return sorted(unusual) ­­>>> unusual_words(nltk.corpus.gutenberg.words('aust en­sense. txt')) Are the unusual or misspelled words.
  • 36. (CC) KBCS CDAC MUMBAI cont.. Stop words: >>> from nltk.corpus import stopwords >>> stopwords.words('english') ­­Percentage of content >>> def content_fraction(text): stopwords = nltk.corpus.stopwords.words('english') content = [w for w in text if w.lower() not in stopwords] return len(content) / len(text) >>> content_fraction(nltk.corpus.reuters.words()) is the useful fraction of useful words!
  • 37. (CC) KBCS CDAC MUMBAI cont.. WordNet: structured, semanticaly orinted dictionary >>> from nltk.corpus import wordnet as wn get synset/collection of synonyms >>> wn.synsets('motorcar') now get synonyms(lemmas) >>> wn.synset('car.n.01').lemma_names
  • 38. (CC) KBCS CDAC MUMBAI cont.. What does car.n.01 actually mean? >>> wn.synset('car.n.01').definition An example: >>> wn.synset('car.n.01').examples Also >>>wn.lemma('supply.n.02.supply').antonyms() Useful data in many applications like WSD, understanding text, question answering, etc.
  • 39. (CC) KBCS CDAC MUMBAI cont.. Text from web: >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> raw = urlopen(url).read() >>> type(raw) >>> len(raw) is in characters...!
  • 40. (CC) KBCS CDAC MUMBAI cont.. Tokenization: >>> tokens = nltk.word_tokenize(raw) >>> type(tokens) >>> len(tokens) >>> tokens[:15] convert to Text >>> text = nltk.Text(tokens) >>> type(text) Now you can do all above things with this!
  • 41. cont.. What about HTML? >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = urlopen(url).read() >>> html[:60] It's HTML; Clean the tags... >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens[:60] >>> text = nltk.Text(tokens) (CC) KBCS CDAC MUMBAI
  • 42. (CC) KBCS CDAC MUMBAI cont.. Try it yourself Processing RSS feeds using universal feed parser ( http://feedparser.org/) Opening text from local machine: >>> f = open('document.txt') >>> raw = f.read() (make sure you are in current directory) then tokenize and bring to Text and play with it
  • 43. cont.. Capture user input: >>> s = raw_input("Enter some text: ") >>> print "You typed", len(nltk.word_tokenize(s)), "words." Writing O/p to files >>> output_file = open('output.txt', 'w') >>> words = set(nltk.corpus.genesis.words('english­kjv. (CC) KBCS CDAC MUMBAI txt')) >>> for word in sorted(words): output_file.write(word + "n")
  • 44. cont.. Stemming : NLTK has inbuilt stemmers >>>text =['apples','called','indians','applying'] >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>>[porter.stem(p) for p in text] >>>[lancaster.stem(p) for p in text] Lemmatization: WordNet lemmetizer, removes affixes only if the new word exist in dictionary >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize((CC) KBCS t) for CDAC t MUMBAI in text]
  • 45. (CC) KBCS CDAC MUMBAI cont.. Using a POS tagger: >>> text = nltk.word_tokenize("And now for something completely different") Query the POS tag documentation as: >>>nltk.help.upenn_tagset('NNP') Parsing with NLTK an example
  • 46. cont.. Classifying with NLTK: Problem: Gender identification from name Feature last letter >>> def gender_features(word): return {'last_letter': word[­1]} (CC) KBCS CDAC MUMBAI Get data >>> from nltk.corpus import names >>> import random
  • 47. (CC) KBCS CDAC MUMBAI cont.. >>> names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]) >>> random.shuffle(names) Get Feature set >>> featuresets = [(gender_features(n), g) for (n,g) in names] >>> train_set, test_set = featuresets[500:], featuresets[:500] Train >>> classifier = nltk.NaiveBayesClassifier.train(train_set) Test >>> classifier.classify(gender_features('Neo'))
  • 48. (CC) KBCS CDAC MUMBAI Conclusion NLTK is rich resource for starting with Natural Language Processing and it's being open source and based on python adds feathers to it's cap!
  • 49. (CC) KBCS CDAC MUMBAI References ➢ http://www.nltk.org/ ➢ http://www.python.org/ ➢ http://wikipedia.org/ (all accessed latest on 19th March 2010) ➢ Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY
  • 50. Thank you #More on NLP @ http://matra2.blogspot.com/ #Slides were prepared for 'supporting' the talk and are distributed as it is under CC 3.0 (CC) KBCS CDAC MUMBAI