The document discusses the Natural Language Toolkit (NLTK), an open source Python library for natural language processing. It provides an overview of NLTK and natural language processing (NLP), describes how to install NLTK, and demonstrates how to perform basic NLP tasks like tokenization, stemming, part-of-speech tagging, and frequency distributions using NLTK on sample texts. The document is intended as an introduction and tutorial for using NLTK for basic text analysis and natural language processing in Python.
Large scale nlp using python's nltk on azurecloudbeatsch
This presentation provides an introduction to natural language processing (nlp) using python's natural language toolkit (nltk). Furthermore it describes how to run python (and more specifically nltk) as an elastic webjob on Azure.
The presentation describes how to install the NLTK and work out the basics of text processing with it. The slides were meant for supporting the talk and may not be containing much details.Many of the examples given in the slides are from the NLTK book (http://www.amazon.com/Natural-Language-Processing-Python-Steven/dp/0596516495/ref=sr_1_1?ie=UTF8&s=books&qid=1282107366&sr=8-1-spell ).
A sprint thru Python's Natural Language ToolKit, presented at SFPython on 9/14/2011. Covers tokenization, part of speech tagging, chunking & NER, text classification, and training text classifiers with nltk-trainer.
Basic NLP concepts and ideas using Python and NLTK framework. Explore NLP prosessing features, compute PMI, see how Python/Nltk can simplify your NLP related task.
Large scale nlp using python's nltk on azurecloudbeatsch
This presentation provides an introduction to natural language processing (nlp) using python's natural language toolkit (nltk). Furthermore it describes how to run python (and more specifically nltk) as an elastic webjob on Azure.
The presentation describes how to install the NLTK and work out the basics of text processing with it. The slides were meant for supporting the talk and may not be containing much details.Many of the examples given in the slides are from the NLTK book (http://www.amazon.com/Natural-Language-Processing-Python-Steven/dp/0596516495/ref=sr_1_1?ie=UTF8&s=books&qid=1282107366&sr=8-1-spell ).
A sprint thru Python's Natural Language ToolKit, presented at SFPython on 9/14/2011. Covers tokenization, part of speech tagging, chunking & NER, text classification, and training text classifiers with nltk-trainer.
Basic NLP concepts and ideas using Python and NLTK framework. Explore NLP prosessing features, compute PMI, see how Python/Nltk can simplify your NLP related task.
Python Refactoring with Rope and Traad – The rope library is a powerful tool for refactoring Python code, but to be truly useful it needs to be available to development environments. Traad is a tool which makes it simpler to integrate rope into nearly any tool by exposing a simple HTTP API. In this session we’ll look at how traad and rope work together, and we’ll see how traad integrates with at least one popular editor.
Introduction to source{d} Engine and source{d} Lookout source{d}
Join us for a presentation and demo of source{d} Engine and source{d} Lookout. Combining code retrieval, language agnostic parsing, and git management tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout, a service for assisted code review that enables running custom code analyzers on GitHub pull requests.
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
This slides introduce a python toolkit for Natural Language Processing (NLP). The author introduces several useful topics in NLTK and demonstrates with code examples.
Avec la version 9 sortie en septembre 2017, Java appuie sur la pédale ! Le rythme des livraisons passe à une version majeure tous les 6 mois. Java 10 est sorti en mars, prochaine version en septembre. Java 10 apporte le 'var' et l'inférence de type pour les variables locales. D'autres nouveautés sont en préparation : les constantes dynamiques, les classes de données, un nouveau switch à base de lambda, des interfaces fermées, de nouvelles choses du coté des génériques et bien plus encore.
Cela viendra-t-il en 11, 12, 15 ? Ne spéculons pas, mais quand ces nouveautés seront prêtes, elles sortiront en quelques mois. On se propose de présenter ces nouveautés, celles qui sont presque prêtes, celles qui seront prêtes bientôt, et celles qui ne seront pas prêtes avant un moment. Quels seront les impacts sur le langage, sur la JVM et donc sur les performances ? Que cela va-t-il nous apporter au quotidien, en tant que développeurs ? Quels seront les nouveaux patterns ? Voici le programme de cette présentation, avec des slides, du code, de la joie et de la bonne humeur !
A short presentation about practical aspects of asynchronous data transfer with Netty
HTML version: http://vcherkassky.github.com/reveal.js/netty.html
Additional resources (from last slide):
* https://github.com/netty/netty
* http://seeallhearall.blogspot.co.uk/2012/05/netty-tutorial-part-1-introduction-to.html
Done with reveal.js: https://github.com/hakimel/reveal.js/
Python Refactoring with Rope and Traad – The rope library is a powerful tool for refactoring Python code, but to be truly useful it needs to be available to development environments. Traad is a tool which makes it simpler to integrate rope into nearly any tool by exposing a simple HTTP API. In this session we’ll look at how traad and rope work together, and we’ll see how traad integrates with at least one popular editor.
Introduction to source{d} Engine and source{d} Lookout source{d}
Join us for a presentation and demo of source{d} Engine and source{d} Lookout. Combining code retrieval, language agnostic parsing, and git management tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout, a service for assisted code review that enables running custom code analyzers on GitHub pull requests.
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
This slides introduce a python toolkit for Natural Language Processing (NLP). The author introduces several useful topics in NLTK and demonstrates with code examples.
Avec la version 9 sortie en septembre 2017, Java appuie sur la pédale ! Le rythme des livraisons passe à une version majeure tous les 6 mois. Java 10 est sorti en mars, prochaine version en septembre. Java 10 apporte le 'var' et l'inférence de type pour les variables locales. D'autres nouveautés sont en préparation : les constantes dynamiques, les classes de données, un nouveau switch à base de lambda, des interfaces fermées, de nouvelles choses du coté des génériques et bien plus encore.
Cela viendra-t-il en 11, 12, 15 ? Ne spéculons pas, mais quand ces nouveautés seront prêtes, elles sortiront en quelques mois. On se propose de présenter ces nouveautés, celles qui sont presque prêtes, celles qui seront prêtes bientôt, et celles qui ne seront pas prêtes avant un moment. Quels seront les impacts sur le langage, sur la JVM et donc sur les performances ? Que cela va-t-il nous apporter au quotidien, en tant que développeurs ? Quels seront les nouveaux patterns ? Voici le programme de cette présentation, avec des slides, du code, de la joie et de la bonne humeur !
A short presentation about practical aspects of asynchronous data transfer with Netty
HTML version: http://vcherkassky.github.com/reveal.js/netty.html
Additional resources (from last slide):
* https://github.com/netty/netty
* http://seeallhearall.blogspot.co.uk/2012/05/netty-tutorial-part-1-introduction-to.html
Done with reveal.js: https://github.com/hakimel/reveal.js/
Presentation with a brief history of C, C++ and their ancestors along with an introduction to latest version C++11 and futures such as C++17. The presentation covers applications that use C++, C++11 compilers such as LLVM/Clang, some of the new language features in C++11 and C++17 and examples of modern idioms such as the new form compressions, initializer lists, lambdas, compile time type identification, improved memory management and improved standard library (threads, math, random, chrono, etc). (less == more) || (more == more)
The presentation from SPbPython meetup about simple self-made just-in-time (JIT) compiler for Python code.
N-th Fibonacci sequence number returning function is JIT-ed in the example.
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...Julia Cherniak
In this talk, I’d like to show that engineer, in order to make progress, should develop its own “outside the box” thinking. Experienced programmer regardless of the language ought to look at things from various standpoints outside the commonly used paradigm. This allows her to choose the proper strategy which fits the task, customer’s requirements, saves time and money. Having our product as an example, I’d like to show new language and new methods, which are not that frequently used in the mainstream. I believe this will broaden the horizon of the conference audience.
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
1. Natural Language Toolkit [NLTK]
Prakash B Pimpale
pbpimpale@gmail.com
@
FOSS(From the Open Source Shelf)
An open source softwares seminar series
(CC) KBCS CDAC MUMBAI
2. (CC) KBCS CDAC MUMBAI
Let's start
➢ What is it about?
➢ NLTK – installing and using it for some 'basic NLP'
and Text Processing tasks
➢ Some NLP and Some python
➢ How will we proceed ?
➢ NLTK introduction
➢ NLP introduction
➢ NLTK installation
➢ Playing with text using NLTK
➢ Conclusion
3. (CC) KBCS CDAC MUMBAI
NLTK
➢ Open source python modules and linguistic
data for Natural Langauge Processing
application
➢ Dveloped by group of people, project leaders:
Steven Bird, Edward Loper, Ewan Klein
➢ Python: simple, free and open source, portable,
object oriented, interpreted
4. (CC) KBCS CDAC MUMBAI
NLP
➢ Natural Language Processing: field of computer science
and linguistics woks out interactions between computers
and human (natural) languages.
➢ some brand ambassadors: machine translation,
automatic summarization, information extraction,
transliteration, question answering, opinion mining
➢ Basic tasks: stemming, POS tagging, chunking, parsing,
etc.
➢ Involves simple frequency count to undertanding and
generating complex text
5. Terms/Steps in NLP
➢ Tokenization: getting words and punctuations
out from text
➢ Stemming: getting the (inflectional) root of word;
plays, playing, played : play
(CC) KBCS CDAC MUMBAI
6. (CC) KBCS CDAC MUMBAI
cont..
POS(Part of Speech) tagging:
Ram NNP
killed VBD
Ravana NNP
POS tags: NNPproper
noun
VBD verb, past tense
(Penn tree bank tagset )
7. (CC) KBCS CDAC MUMBAI
cont..
➢ Chunking: groups the similar POS tags (liberal
def)
8. cont..
Parsing: Identifying gramatical structures in a
sentence: Mary saw a dog
*(S
(NP Mary)
**(VP
(V saw)
***(NP
(Det a)
(N dog) )*** )** )*
(CC) KBCS CDAC MUMBAI
9. (CC) KBCS CDAC MUMBAI
Applications
➢ Machine Translation
MaTra, Google translate
➢ Transliteration
Xlit, Google Indic
➢ Text classification
Spam filters
many more.......!
10. (CC) KBCS CDAC MUMBAI
Challenges
➢ Word sense disambiguation
Bank(financial/river)
Plant(industrial/natural)
➢ Name entity recognition
CDAC is in Mumbai
➢ Clause Boundary detection
Ram who was the king of Ayodhya killed
Ravana
and many more.....!
11. Installing NLTK
Windows
➢ Install Python: Version 2.4.*, 2.5.*, or 2.6.* and
not 3.0
available http://www.nltk.org/download
[great instructions!]
(CC) KBCS CDAC MUMBAI
➢ Install PyYAML:
➢ Install Numpy
➢ Install Matplotlib
➢ Install NLTK
12. (CC) KBCS CDAC MUMBAI
cont..
Linux/Unix
➢ Most likely python will be there (if not>
package manager)
➢ Get pythonnympy
and pythonscipy
packages
(scintific library for multidimentional array and
linear algebra), use package manager to install
or
'sudo aptget
install pythonnumpy
pythonscipy'
13. (CC) KBCS CDAC MUMBAI
cont..
NLTK Source Installation
➢ Download NLTK source
( http://nltk.googlecode.com/files/nltk2.0b8.
zip)
➢ Unzip it
➢ Go to the new unziped folder
➢ Just do it!
'sudo python setup.py install'
Done :)
!
14. Installing Data
➢ We need data to experiment with
It's too simple (same for all platform)...
$python #(IDLE) enter to python prompt
>>> import nltk
>>> nltk.download()
select any one option that you like/need
(CC) KBCS CDAC MUMBAI
16. (CC) KBCS CDAC MUMBAI
cont..
Test installation
( set path if needed: as
export NLTK_DATA = '/home/....../nltk_data')
>>> from nltk.corpus import brown
>>> brown.words()[0:20]
['The', 'Fulton', 'County', 'Grand', 'Jury']
All set!
17. Playing with text
Feeling python
$python
>>>print 'Hello world'
>>>2+3+4
(CC) KBCS CDAC MUMBAI
Okay !
18. (CC) KBCS CDAC MUMBAI
cont..
Data from book:
>>>from nltk.book import *
see what is it?
>>>text1
Search the word:
>>> text1.concordance("monstrous")
Find similar words
>>> text1.similar("monstrous")
19. cont..
Find common context
>>>text2.common_contexts(["monstrous", "very"])
Find position of the word in entire text
>>>text4.dispersion_plot(["citizens", "democracy",
"freedom", "duties", "America"])
Counting vocab
>>>len(text4) #gives tokens(words and
punctuations)
(CC) KBCS CDAC MUMBAI
20. cont..
Vocabulary(unique words) used by author
>>>len(set(text1))
See sorted vocabulary(upper case first!)
>>>sorted(set(text1))
Measuring richness of text(average repetition)
>>>len(text1)/len(set(text1))
(before division import for floating point division
from __future__ import division)
(CC) KBCS CDAC MUMBAI
21. (CC) KBCS CDAC MUMBAI
cont..
Counting occurances
>>>text5.count("lol")
% of the text taken
>>> 100 * text5.count('a') / len(text5)
let's write function for it in python
>>>def prcntg_oftext(word,text):
return100*.count(wrd)/len(txt)
22. cont..
Writing our own text
>>>textown =['a','text','is','a','list','of','words']
adding texts(lists, can contain anything)
>>>sent = ['pqr']
>>>sent2 = ['xyz']
>>>sent+sent2
append to text
>>>sent.append('abc')
(CC) KBCS CDAC MUMBAI
23. (CC) KBCS CDAC MUMBAI
cont..
Indexing text
>>> text4[173]
or
>>>text4.index('awaken')
slicing
text1[3:4], text[:5], text[10:] try it! (notice ending
index one less than specified)
24. (CC) KBCS CDAC MUMBAI
cont..
Strings in python
>>>name = 'astring'
>>>name[0]
what's more
>>>name*2
>>>'SPACE '.join(['let us','join'])
25. (CC) KBCS CDAC MUMBAI
cont..
Statistics to text
Frequency Distribution
>>> fdist1 = FreqDist(text1)
>>> fdist1
>>> vocabulary1 = fdist1.keys()
>>>vocabulary1[:25]
let's plot it...
>>>fdist1.plot(50, cumulative=True)
26. (CC) KBCS CDAC MUMBAI
cont..
*what the text1 is about?
*are all the words meaningful?
Let's find long words i.e.
Words from text1 with length more than 10
in python
>>>longw=[word for word in text1 if len(word)>10]
let's see it...
>>>sorted(longw)
27. (CC) KBCS CDAC MUMBAI
cont..
*most long words have small freq and are not
major topic of text many times
*words with long length and certain freq may
help to identify text topic
>>>sorted([w for w in set(text5) if len(w) > 7 and
fdist5[w] > 7])
28. cont..
Bigrams and collocations:
*Bigrams
>>>sent=['near','river', 'bank']
>>>bigrams(sent)
basic to many NLP apps; transliteration,
translation
*collocations: frequent bigrams of rare words
>>>text4.collocations()
basic to some WSD approaches
(CC) KBCS CDAC MUMBAI
31. (CC) KBCS CDAC MUMBAI
cont
Bigger probs:
WSD, Anaphora resolution, Machine translation
try and laugh:
>>>babelize_shell()
>>>german
>>>run
or try it here:
http://translationparty.com/#6917644
32. cont..
Playing with text corpora
Available corporas: Gutenberg, Web and chat
text, Brown, Reuters, inaugural address, etc.
>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
>>>fileid[:4] for fileid in
inaugural.fileids()]
(CC) KBCS CDAC MUMBAI
33. (CC) KBCS CDAC MUMBAI
cont..
>>> cfd = nltk.ConditionalFreqDist(
(target, file[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
>>> cfd.plot()
Conditional Freq distribution of words 'america' and
'citizen' with different years
34. cont..
Generating text (applying bigrams)
>>> def generate_model(cfdist, word, num=15):
for i in range(num):
print word,
word = cfdist[word].max()
(CC) KBCS CDAC MUMBAI
>>>
text = nltk.corpus.genesis.words('englishkjv.
txt')
>>>bigrams = nltk.bigrams(text)
>>>cfd = nltk.ConditionalFreqDist(bigrams)
>>print cfd['living']
>>generate_model(cfd, 'living')
35. (CC) KBCS CDAC MUMBAI
cont..
WordList corpora
>>>def unusual_words(text):
text_vocab = set(w.lower() for w in text if
w.isalpha())
english_vocab = set(w.lower() for w in
nltk.corpus.words.words())
unusual = text_vocab.difference(english_vocab)
return sorted(unusual)
>>>
unusual_words(nltk.corpus.gutenberg.words('aust
ensense.
txt'))
Are the unusual or misspelled words.
36. (CC) KBCS CDAC MUMBAI
cont..
Stop words:
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
Percentage
of content
>>> def content_fraction(text):
stopwords = nltk.corpus.stopwords.words('english')
content = [w for w in text if w.lower() not in stopwords]
return len(content) / len(text)
>>> content_fraction(nltk.corpus.reuters.words())
is the useful fraction of useful words!
37. (CC) KBCS CDAC MUMBAI
cont..
WordNet: structured, semanticaly orinted
dictionary
>>> from nltk.corpus import wordnet as wn
get synset/collection of synonyms
>>> wn.synsets('motorcar')
now get synonyms(lemmas)
>>> wn.synset('car.n.01').lemma_names
38. (CC) KBCS CDAC MUMBAI
cont..
What does car.n.01 actually mean?
>>> wn.synset('car.n.01').definition
An example:
>>> wn.synset('car.n.01').examples
Also
>>>wn.lemma('supply.n.02.supply').antonyms()
Useful data in many applications like WSD,
understanding text, question answering, etc.
39. (CC) KBCS CDAC MUMBAI
cont..
Text from web:
>>> from urllib import urlopen
>>> url =
"http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> type(raw)
>>> len(raw)
is in characters...!
40. (CC) KBCS CDAC MUMBAI
cont..
Tokenization:
>>> tokens = nltk.word_tokenize(raw)
>>> type(tokens)
>>> len(tokens)
>>> tokens[:15]
convert to Text
>>> text = nltk.Text(tokens)
>>> type(text)
Now you can do all above things with this!
41. cont..
What about HTML?
>>> url =
"http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()
>>> html[:60]
It's HTML; Clean the tags...
>>> raw = nltk.clean_html(html)
>>> tokens = nltk.word_tokenize(raw)
>>> tokens[:60]
>>> text = nltk.Text(tokens)
(CC) KBCS CDAC MUMBAI
42. (CC) KBCS CDAC MUMBAI
cont..
Try it yourself
Processing RSS feeds using universal feed parser (
http://feedparser.org/)
Opening text from local machine:
>>> f = open('document.txt')
>>> raw = f.read()
(make sure you are in current directory)
then tokenize and bring to Text and play with it
43. cont..
Capture user input:
>>> s = raw_input("Enter some text: ")
>>> print "You typed", len(nltk.word_tokenize(s)),
"words."
Writing O/p to files
>>> output_file = open('output.txt', 'w')
>>> words = set(nltk.corpus.genesis.words('englishkjv.
(CC) KBCS CDAC MUMBAI
txt'))
>>> for word in sorted(words):
output_file.write(word + "n")
44. cont..
Stemming : NLTK has inbuilt stemmers
>>>text =['apples','called','indians','applying']
>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()
>>>[porter.stem(p) for p in text]
>>>[lancaster.stem(p) for p in text]
Lemmatization: WordNet lemmetizer, removes
affixes only if the new word exist in dictionary
>>> wnl = nltk.WordNetLemmatizer()
>>> [wnl.lemmatize((CC) KBCS t) for CDAC t MUMBAI
in text]
45. (CC) KBCS CDAC MUMBAI
cont..
Using a POS tagger:
>>> text = nltk.word_tokenize("And now for
something completely different")
Query the POS tag documentation as:
>>>nltk.help.upenn_tagset('NNP')
Parsing with NLTK
an example
46. cont..
Classifying with NLTK:
Problem: Gender identification from name
Feature last letter
>>> def gender_features(word):
return {'last_letter': word[1]}
(CC) KBCS CDAC MUMBAI
Get data
>>> from nltk.corpus import names
>>> import random
47. (CC) KBCS CDAC MUMBAI
cont..
>>> names = ([(name, 'male') for name in
names.words('male.txt')] +
[(name, 'female') for name in
names.words('female.txt')])
>>> random.shuffle(names)
Get Feature set
>>> featuresets = [(gender_features(n), g) for (n,g) in
names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
Train
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
Test
>>> classifier.classify(gender_features('Neo'))
48. (CC) KBCS CDAC MUMBAI
Conclusion
NLTK is rich resource for starting with Natural
Language Processing and it's being open
source and based on python adds feathers to
it's cap!
49. (CC) KBCS CDAC MUMBAI
References
➢ http://www.nltk.org/
➢ http://www.python.org/
➢ http://wikipedia.org/
(all accessed latest on 19th March 2010)
➢ Natural Language Processing with Python, Steven Bird, Ewan Klein, and
Edward Loper, O'REILLY
50. Thank you
#More on NLP @ http://matra2.blogspot.com/
#Slides were prepared for 'supporting' the talk and are distributed as it is under CC 3.0
(CC) KBCS CDAC MUMBAI