5. >>>inputstring = ' This is an example sent. The sentence
splitter will split on sent markers. Ohh really !!'
>>>from nltk.tokenize import sent_tokenize
>>>all_sent = sent_tokenize(inputstring)
>>>print all_sent
[' This is an example sent', 'The sentence splitter will split
on markers.','Ohh really !!']
13. Snowball stemmers that can be used for Dutch, English,
French, German, Italian, Portuguese, Romanian, Russian,
and so on
14. For Snowball Stemmer, which is based on Snowball
Stemming Algorithm, can be used in NLTK like this:
>>> from nltk.stem import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer(“english”)
>>> snowball_stemmer.stem(‘maximum’)
u’maximum’
>>> snowball_stemmer.stem(‘presumably’)
u’presum’
>>> snowball_stemmer.stem(‘multiply’)
u’multipli’
>>> snowball_stemmer.stem(‘provision’)
u’provis’
>>> snowball_stemmer.stem(‘owed’)
u’owe’
>>> snowball_stemmer.stem(‘ear’)
u’ear’
18. >>>from nltk.corpus import stopwords
>>>stoplist = stopwords.words('english') # config the
language name
# NLTK supports 22 languages for removing the stop
words
>>>text = "This is just a test"
>>>cleanwordlist = [word for word in text.split() if word not
in stoplist]
# apart from just and test others are stopwords
['test']
StopWords
19. >>># tokens is a list of all tokens in corpus
>>>freq_dist = nltk.FreqDist(token)
>>>rarewords = freq_dist.keys()[-50:]
>>>after_rare_words = [ word for word in token not in
rarewords]
Rarewords
20.
21. >>>import nltk
>>>from nltk import word_tokenize
>>>s = "I was watching TV"
>>>print nltk.pos_tag(word_tokenize(s))
[('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]
What is Part of speech tagging