SlideShare a Scribd company logo
1 of 79
Text Analytics With
NLTK
Girish Khanzode
Contents
• Tokenization
• Corpuses
• Frequency Distribution
• Stylistics
• SentenceTokenization
• WordNet
• Stemming
• Lemmatization
• Part of SpeechTagging
• Tagging Methods
• UnigramTagging
• N-gramTagging
• Chunking – Shallow Parsing
• Entity Recognition
• SupervisedClassification
• DocumentClassification
• Hidden Markov Models - HMM
• References
NLTK
• A set of Python modules to carry out many common natural language
tasks.
• Basic classes to represent data for NLP
• Infrastructure to build NLP programs in Python
• Python interface to over 50 corpora and lexical resources
• Focus on Machine Learning with specific domain knowledge
• Free and Open Source
NLTK
• Numpy and Scipy under the hood
• Fast and Formal
• Standard interfaces for tokenization, part-of-speech tagging, syntactic parsing
and text classification
• Windows:
>>> import nltk
>>> nltk.download('all')
• Linux
$ pip install --upgrade nltk
NLTK -Top-Level Organization
• Organized as a flat hierarchy of packages and modules
• Each module provides the tools necessary to address a specific task
• Modules has two types of classes
– Data-oriented classes
• Used to represent information relevant to natural language processing.
– Task-oriented classes
• Encapsulate the resources and methods needed to perform a specific task.
Modules
• Token - classes for representing and processing individual elements of
text, such as words and sentences
• Probability - classes for representing and processing probabilistic
information
• Tree - classes for representing and processing hierarchical information
over text
• Cfg - classes for representing and processing context free grammars
Modules
• Tagger - tagging each word with a part-of-speech, a sense, etc
• Parser - building trees over text (includes chart, chunk and probabilistic
parsers)
• Classifier - classify text into categories (includes feature,
featureSelection, maxent, naivebayes)
• Draw - visualize NLP structures and processes
• Corpus - access (tagged) corpus data
Tokenization
• Simplest way to represent a text is with a single string
• Difficult to process text in this format
• Convenient to work with a list of tokens
• Task of converting a text from a single string to a list of tokens is known as
tokenization
• The most basic natural language processing technique
• Example -WordTokenization
Input : “Hey there, How are you all?”
Output : “Hey”, “there,”, “How”, “are”, “you”, “all?”
Tokens andTypes
• The term word can be used in two different ways
– To refer to an individual occurrence of a word
– To refer to an abstract vocabulary item
• For example, the sentence “my dog likes his dog” contains five occurrences of
words, but four vocabulary items
Tokens andTypes
• To avoid confusion use more precise terminology
– Word token - an occurrence of a word
– WordType - a vocabulary item
• Tokens constructed from their types using theToken constructor
• Token member functions - type and loc
Tokens andTypes
>>> from nltk.token import *
>>> my_word_type = 'dog‘
'dog’
>>> my_word_token =Token(my_word_type) ‘dog'@[?]
Text Locations
• Text location @ [s:e] specifies a region of a text
– s is the start index
– e is the end index
• Specifies the text beginning at s, and including everything up to (but not
including) the text at e
• Consistent with Python slice
Text Locations
• Think of indices as appearing between elements
– I saw a man
– 0 1 2 3 4
• Shorthand notation when location width = 1
• Indices based on different units
– character
– word
– sentence
Text Locations
• Locations tagged with sources
– files, other text locations – the first word of the first sentence in the file
• Location member functions
– start
– end
– unit
– source
Text Corpus
• Large collection of text
• Concentrate on a topic or open domain
• May be raw text or annotated / categorized
Corpuses
• Gutenberg - selection of e-books from Project Gutenberg
• Webtext - forum discussions, reviews, movie script
• nps_chat - anonymized chats
• Brown - 1 million word corpus, categorized by genre
• Reuters - news corpus
• Inaugural - inaugural addresses of presidents
• Udhr - multilingual corpus
Accessing Corpora
• Corpora on disk - text files
• NLTK provides Python modules / functions / classes that allow for
accessing the corpora in a convenient way
• It is quite an effort to write functions that read in a corpus especially when
it comes with annotations
• The task of reading in a corpus is needed in many NLP projects
Accessing Corpora
• # tell Python we want to use the Gutenberg corpus
• from nltk.corpus import gutenberg
• # which files are in this corpus?
• print(gutenberg.fileids())
• >>> ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-
kjv.txt', ...]
Accessing Corpora - RawText
• # get the raw text of a corpus = one string
• >>> emmaText = gutenberg.raw("austen-emma.txt")
• # print the first 289 characters of the text
• >>> emmaText = gutenberg.raw("austen-emma.txt")
• >>> emmaText[:289]
• '[Emma by Jane Austen 1816]nnVOLUME InnCHAPTER InnnEmmaWoodhouse, handsome, clever,
and rich, with a comfortable homenand happy disposition, seemed to unite some of the best
blessingsnof existence; and had lived nearly twenty-one years in the worldnwith very little to distress or
vex her.‘
Accessing Corpora -Words
• # get the words of a corpus as a list
• emmaWords = gutenberg.words("austen-emma.txt")
• # print the first 30 words of the text
• >>> print(emmaWords[:30])
• ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma',
'Woodhouse‘, 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home',
'and‘, 'happy', 'disposition', ',', 'seemed']
Accessing Corpora: Sentences
• # get the sentences of a corpus as a list of lists - one list of words per sentence
• >>> senseSents = gutenberg.sents("austen-sense.txt")
• # print out the first four sentences
• >>> print(senseSents[:4])
• [['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']'], ['CHAPTER', '1'], ['The',
'family', 'of', 'Dashwood', 'had', 'long‘, 'been', 'settled', 'in', 'Sussex', '.'], ['Their', 'estate',
'was', 'large', ',', 'and‘, 'their', 'residence', 'was', 'at', ...]]
Counting
• Use Inaugural Address text.
• >>> from nltk.book import text4
• Counting vocabulary: the length of a
text from start to fnish
• >>> len(text4)
• 145735
• How many distinct words?
• >>> len(set(text4)) #types
• 9754
• Richness of the text.
• >>> len(text4) / len(set(text4))
• 14.941049825712529
• >>> 100 * text4.count('democracy') /
len(text4)
• 0.03568120218204275
Positions of a Word inText
Lexical Dispersion Plot
List Elements Operations
• List comprehension
– >>> len(set([word.lower() for word in
text4 if len(word)>5]))
– 7339
– >>> [w.upper() for w in text4[0:5]]
– ['FELLOW', '-', 'CITIZENS', 'OF', 'THE']
• Loops and conditionals
• For word in text4[0:5]:
if len(word)<5 and word.endswith('e'):
print word, ' is short and ends with e‘
elif word.istitle():
print word, ' is a titlecase word‘
else:
print word, 'is just another word'
Brown Corpus
• First million-word electronic corpus of English
• Created at Brown University in 1961
• Text from 500 sources, categorized by genre
• >>> from nltk.corpus import brown
• >>> print(brown.categories())
• ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor‘,
'learned', 'lore', 'mystery', 'news', 'religion‘, 'reviews', 'romance', 'science_fiction']
Brown Corpus – RetrieveWords by Category
• >>> from nltk.corpus import brown
• >>> news_words = brown.words(categories = "news")
• >>> print(news_words)
• ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation',
'of', "Atlanta's“, 'recent', 'primary', 'election', 'produced', ...]
Brown Corpus – RetrieveWords by Category
• >>> adv_words = brown.words(categories = "adventure")
• >>> print(adv_words)
• ['Dan', 'Morgan', 'told', 'himself', 'he', 'would‘, 'forget', 'Ann', 'Turner', '.', ...]
• >>> reli_words = brown.words(categories = "religion")
• >>> print(reli_words)
• ['As', 'a', 'result', ',', 'although', 'we', 'still', 'make', 'use', 'of', 'this', 'distinction', ',',...]
Frequency Distribution
• Records how often each item occurs in a list of words
• Frequency distribution over words
• Basically a dictionary with some extra functionality
• init creates a frequency distribution from a list of words
Frequency Distribution
• >>>news_words = brown.words(categories = "news")
• >>>fdist = nltk.FreqDist(news_words)
• >>>print("shoe:", fdist["shoe"])
• >>>print("the: ", fdist["the"])
Frequency Distribution
• # show the 10 most frequent words & frequencies
• >>>fdist.tabulate(10)
• the , . Of and to a in for The
• 5580 5188 4030 2849 2146 2116 1993 1893 943 806
Plot Frequency Distribution
• Create a plot of the 10 most frequent words
• >>>fdist.plot(10)
Stylistics
• Systematic differences between genres
• Brown corpus with its categories is a convenient resource
• Is there a difference in how the modal verbs (can, could, may, might,
must, will) are used in the genres?
• Let us look at the frequency distribution
Stylistics
• from nltk import FreqDist
• # Define modals of interest
• >>>modals = ["may", "could", "will"]
• # Define genres of interest
• >>>genres = ["adventure", "news",
"government", "romance"]
• # count how often they occur in the genres
of interest
• >>>for g in genres:
• >>>words = brown.words(categories = g)
• >>>fdist = FreqDist([w.lower() for w in
words
• >>> if w.lower() in modals])
• >>>print g, fdist
Conditional Frequency Distributions
• >>>from nltk import ConditionalFreqDist
• >>>cfdist = ConditionalFreqDist()
• >>>for g in genres:
words = brown.words(categories = g)
for w in words
if w.lower() in modals:
cfdist[g].inc(w.lower())
• >>> cfdist.tabulate()
could may will
Adventure 154 7 51
Government 38 179 244
News 87 93 389
Romance 195 11 49
• >>>cfdist.plot(title="Modals in various Genres")
Conditional Frequency Distributions
Processing RawText
• Assume you have a text file on your disk...
• # Read the text
• >>> path = "holmes.txt“
• >>> f = open(path)
• >>> rawText = f.read()
• >>> f.close()
• >>> print(rawText[:165])
• THE ADVENTURES OF SHERLOCK HOLMES
• By
• SIR ARTHUR CONAN DOYLE
I. A Scandal in Bohemia
II.The Red-headed League
SentenceTokenization
• # Split the text up into sentences
• >>> sents = nltk.sent_tokenize(raw)
• >>> print(sents[20:22])
• ['I had seen little of Holmes lately.', 'My marriage had drifted usrnaway from
each other.‘, ...]
WordTokenization
• >>># Tokenize the sentences using nltk
• >>>tokens = []
• >>>for sent in sents:
tokens += nltk.word_tokenize(sent)
• >>>print(tokens[300:350])
• [’such’, ’as’, ’his’, ’.’, ’And’, ’yet’, ’there’, ’was’, ’but’, ’one’, ’woman’, ’to’, ’him’, ’,’, ’and’, ’that’,
’woman’, ’was’, ’the’, ’late’, ’Irene’, ’Adler’, ’,’, ’of’, ’dubious’, ’and’, ’questionable’, ’memory’,
...]
Creating aText Object
• Using a list of tokens, we can create an nltk.Text object for a document.
• Collocations = terms that occur together unusually often
• Concordance view = shows the contexts in which a token occurs
Creating aText Object
• >>># Create a text object
• >>>text = nltk.Text(tokens)
• >>># Do stuff with the text object
• >>>print(text.collocations())
• Sherlock Holmes; said Holmes; St. Simon; Baker Street; Lord St.; St. Clair; Mr.
Holmes; HosmerAngel; Irene Adler; Miss Hunter; young lady; Briony Lodge; Stoke
Moran; Neville St.; Miss Stoner; ScotlandYard; could see; Mr. Holmes.; Boscombe
Pool; Mr. Rucastle
ConcordanceView
• >>>print(text.concordance("Irene"))
• >>>Building index...
• >>>Displaying 17 of 17 matches:
• to love for IreneAdler . All emotions , and that one
• was the late IreneAdler , of dubious and questionable
• dventuress , IreneAdler .The name is no doubt familia
• nd . " " And IreneAdler ? " "Threatens to send them t
• se , of Miss IreneAdler . " " Quite so ; but the seque
• And what of IreneAdler ? " I asked . " Oh , she has t
• tying up of IreneAdler , spinster , to Godfrey Norton
• ction . Miss Irene , or Madame , rather , returns from
• ...
Annotated Corpora
• Example -The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd
Friday/nr an/at investigation/nn ...
• Some corpora come with annotations - POS tags, parse trees,...
• NLTK provides convenient access to these corpora (get the text + annotations)
• DependencyTree Bank (e.g. Penn): collection of (dependency-) parsed sentences
(manually annotated), can be used for training a statistical parser or parser
evaluation
WordNet
• Structured, semantically oriented English dictionary
• Synonyms, antonyms, hyponims, hypernims, depth of a synset, trees, entailments,
etc.
• >>> from nltk.corpus import wordnet as wn
• >>> wn.synsets('motorcar')
• [Synset('car.n.01')]
• >>> wn.synset('car.n.01').lemma_names
• ['car', 'auto', 'automobile', 'machine', 'motorcar']
WordNet
• >>> wn.synset('car.n.01').definition
• 'a motor vehicle with four wheels; usually propelled by an internal combustion engine'
• >>> for synset in wn.synsets('car')[1:3]:
• ... print synset.lemma_names
• ['car', 'railcar', 'railway_car', 'railroad_car'] ['car', 'gondola']
• >>> wn.synset('walk.v.01').entailments()
• #Walking involves stepping
• [Synset('step.v.01')]
Getting InputText - HTML
• >>> from urllib import urlopen
• >>> url = "http://www.bbc.co.uk/news/science-environment-21471908"
• >>> html = urlopen(url).read()
• html[:60]
• >>> raw = nltk.clean_html(html)
• '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http'
• >>> tokens = nltk.word_tokenize(raw)
• >>> tokens[:15]
• ['BBC', 'News', '-', 'Exoplanet', 'Kepler', '37b', 'is', 'tiniest‘, 'yet', '-', 'smaller', 'than', 'Mercury', 'Accessibility',
'links‘]
Getting InputText - User
• >>> s = raw_input("Enter some text: ")
• Use your own files on disk
• >>> f = open('C:DataFilesUK_natl_2010_en_Lab.txt')
• >>> raw = f.read()
• >>> print raw[:100]
• #Foreword by Gordon Brown
• This General Election is fought as our troops are bravely fighting to def
Import Files as Corpus
• >>> from nltk.corpus import PlaintextCorpusReader
• >>> corpus_root = "C:/Data/Files/"
• >>> wordlists = PlaintextCorpusReader(corpus_root, '.*.txt')
• >>> wordlists.fileids()[:3]
• ['UK_natl_1987_en_Con.txt', 'UK_natl_1987_en_Lab.txt',
• 'UK_natl_1987_en_LibSDP.txt']
• >>> wordlists.words('UK_natl_2010_en_Lab.txt')
• ['#', 'Foreword', 'by', 'Gordon', 'Brown', '.', 'This', ...]
Stemming
• Strip off affixes
• >>>porter = nltk.PorterStemmer()
• >>>[porter.stem(t) for t in tokens]
• Porter stemmer lying - lie, women - women
• >>>lancaster = nltk.LancasterStemmer()
• >>>[lancaster.stem(t) for t in tokens]
• Lancaster stemmer lying - lying, women - wom
Lemmatization
• Removes affixes if in dictionary
• >>>wnl = nltk.WordNetLemmatizer()
• >>>[wnl.lemmatize(t) for t in tokens]
• lying - lying, women - woman
Write Output to File
• Save separated sentences text to a new file
• >>>output_file = open('C:DataFilesoutput.txt', 'w')
• >>>words = set(sents)
• >>>for word in sorted(words):
• >>> output_file.write(word + "n")
• To write non-text data, first convert it to string - str()
• Avoid filenames that contain space characters or that are identical except for
case distinctions
Part of SpeechTagging
• POSTagging - Process of classifying words into their parts of speech &
labelling them accordingly
– Words grouped into classes, such as nouns, verbs, adjectives, and adverbs
• Parts of speech are also known as word classes or lexical categories
• The collection of tags used for a particular task is known as a tagset
Part of SpeechTagging
• NLTK tags text automatically
– Predicting the behaviour of previously unseen words
– Analyzing word usage in corpora
– Text-to-speech systems
– Powerful searches
– Classification
Tagging Methods
• Default tagger
• Regular expression tagger
• Unigram tagger
• N-gram taggers
Tagging Methods
• Can be combined using a technique known as backoff
– when a more specialized model (such as a bigram tagger) cannot assign a tag
in a given context, we backoff to a more general model (such as a unigram
tagger)
• Taggers can be trained and evaluated using tagged corpora
Tagging Examples
• Some corpora already tagged
• >>> nltk.corpus.brown.tagged_words()
• [('The', 'AT'), ('Fulton', 'NP-TL'), ...]
• A simple example
• >>> nltk.pos_tag(text)
• [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
– CC is coordinating conjunction; RB is adverb; IN is preposition; NN is noun; JJ is adjective
– Lots of others - foreign term, verb tenses, “wh” determiner etc
Tagging Examples
• An example with homonyms
• >>> text = nltk.word_tokenize("They refuse to permit us to obtain the
refuse permit")
• >>> nltk.pos_tag(text)
• [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit‘, 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the‘, 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
UnigramTagging
• Unigram tagging - nltk.UnigramTagger()
– Assign the tag that is most likely for that particular token
– Train it specifying tagged sentence data as a parameter when we initialize the
tagger
– Separate training and testing data
N-gramTagging
• Context is the current word together with the part-of-speech
• Tags of the n-1 preceding tokens
• Evaluate performance
• Contexts that were not present in the training data – accuracy vs. Coverage
• Combine taggers
Information Extraction
• Search large bodies of unrestricted
text for specific types of entities and
relations
• Move these in well-organized
databases
• Use these databases to find answers
for specific questions
Information Extraction - Steps
• Segmenting, tokenizing, and part-of-speech tagging the text
• Search resulting data for specific types of entity
• Examine entities that are mentioned near one another in the text to
determine if specific relationships hold between those entities
Chunking – Shallow Parsing
• Analyzes a sentence to identify the constituents noun groups, verbs, verb groups etc
• However, it does not specify their internal structure, nor their role in the main sentence
• The smaller boxes show word-level tokenization and part-of-speech tagging, while large
boxes show higher-level chunking
• Each of these larger boxes is called a chunk
• Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens
• Like tokenization, the pieces produced by a chunker do not overlap in the source text
Chunking – Shallow Parsing
Entity Recognition
• Entity recognition performed using chunkers
– Segment multi-token sequences and label them with the appropriate entity type
– ORGANIZATION, PERSON, LOCATION, DATE,TIME, MONEY, and GPE (geo-political
entity)
• Constructing chunkers
– Use rule-based systems like RegexpParser class from NLTK
– Using machine learning techniques like ConsecutiveNPChunker
– POS tags are very important in this context.
Relation Extraction
• Rule-based systems - look for specific patterns in the text that connect
entities and the intervening words
• Machine-learning systems - attempt to learn patterns automatically from
a training corpus
ProcessingText
• Choose a particular class label for a given input
• Identify particular features of language data that are salient for classifying it
• Construct models of language that can be used to perform language processing
tasks automatically
• Learn about text/language from these models
• Machine learning techniques
– Decision trees
– Naive Bayes' classifiers
– Maximum entropy classifiers
Applications
• Determining the topic of an article or a book
• Deciding if an email is spam or not
• Determining who wrote a text
• Determining the meaning of a word in a particular context
• Open-class classification - set of labels is not defined in advance
• Multi-class classification - each instance may be assigned multiple labels
• Sequence classification - a list of inputs are jointly classified
Supervised Classification
Example – Identify Gender by Name
• Relevant feature: last letter
• Create a feature set (a dictionary) that maps feature’s names to their values
– >>>def gender_features(word):
– >>>return {'last_letter': word[-1]}
• Import names, shuffle them
– >>>from nltk.corpus import names
– >>>import random
– >>>names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for
name in names.words('female.txt')])
– >>>random.shuffle(names)
Example – Identify Gender by Name
• Divide list of features into training set and test set
– >>>featuresets = [(gender_features(n), g) for (n,g) in names]
– >>>from nltk.classify import apply_features
– >>>#Use apply if you're working with large corpora
– >>>train_set = apply_features(gender_features, names[500:])
– >>>test_set = apply_features(gender_features, names[:500])
• Use training set to train a naive Bayes classier
– >>>classifier = nltk.NaiveBayesClassifier.train(train_set)
Example – Identify Gender by Name
• Test the classier on unseen data
– >>> classifier.classify(gender_features('Neo'))
– >>>'male'
– >>> classifier.classify(gender_features('Trinity'))
– >>>'female‘
• >>> print nltk.classify.accuracy(classifier, test_set)
– >>>0.744
Example – Identify Gender by Name
• Examine the classier to see which feature is most effective at distinguishing
between classes
• >>> classifier.show_most_informative_features(5)
• Most Informative Features
• last_letter = 'a' female : male = 35.7 : 1.0
• last_letter = 'k' male : female = 31.7 : 1.0
• last_letter = 'f' male : female = 16.6 : 1.0
• last_letter = 'p' male : female = 11.9 : 1.0
• last_letter = 'v' male : female = 10.5 : 1.0
Example - Document Classification
• Use corpora where documents have been labelled with categories
– Build classifiers that will automatically tag new documents with appropriate
category labels
• Use the movie review corpus, which categorizes reviews as positive or
negative to construct a list of documents
• Define a feature extractor for documents - feature for each of the most
frequent 2000 words in the corpus
• Define a feature extractor that checks if words are present in a document
• Train a classier to label new movie reviews
Document Classification
• Compute accuracy on the test set
– >>> print nltk.classify.accuracy(classifier, test_set)
– >>> 0.79
• Evaluation issues: size of the test set depends on number of labels, their balance and the diversity of the test.
• Show most informative features
• >>> classifier.show_most_informative_features(5)
– Most Informative Features
– contains(outstanding) =True pos : neg = 11.2 : 1.0
– contains(mulan) =True pos : neg = 8.9 : 1.0
– contains(wonderfully) =True pos : neg = 8.5 : 1.0
– contains(seagal) =True neg : pos = 8.3 : 1.0
– contains(damon) =True pos : neg = 6.0 : 1.0
Context
• Contextual features often provide powerful clues for
classification
• Context-dependent feature extractor - pass in a complete
(untagged) sentence, along with the index of the target word
• Joint classier models - choose an appropriate labelling for a
collection of related inputs
Sequence Classification
• Jointly choose part-of-speech tags for all the words in a given
sentence
• Consecutive classification - find the most likely class label for
the first input, then to use that answer to help find the best
label for the next input, repeat
• Feature extraction function needs to take a history argument
- list of tags predicted so far
Hidden Markov Models - HMM
• Use inputs and the history of predicted tags
• Generate a probability distribution over tags
• Combine probabilities to calculate scores for sequences
• Choose tag sequence with the highest probability
More Advanced Models
• Maximum Entropy Markov Models
• Linear-ChainConditional Random Field Models
References
1. Indurkhya, Nitin and Fred Damerau (eds, 2010) Handbook of Natural Language Processing (Second
Edition)Chapman & Hall/CRC. 2010. (Indurkhya & Damerau, 2010) (Dale, Moisl, & Somers, 2000)
2. Jurafsky, Daniel and James Martin (2008) Speech and Language Processing (Second Edition). Prentice
Hall. (Jurafsky & Martin, 2008)
3. Mitkov, Ruslan (ed, 2003)The Oxford Handbook of Computational Linguistics.Oxford University Press.
(second edition expected in 2010). (Mitkov, 2002)
4. Bird, Steven; Klein, Ewan; Loper, Edward (2009). Natural Language Processing with Python. O'Reilly
Media Inc
5. Perkins, Jacob (2010). PythonText Processing with NLTK 2.0 Cookbook. Packt Publishing
6. Bird, Steven; Klein, Ewan; Loper, Edward; Baldridge, Jason (2008) Proceedings of theThirdWorkshop
on Issues inTeaching Computational Linguistics,ACL
ThankYou
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

More Related Content

What's hot

Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.netwww.myassignmenthelp.net
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity RecognitionTomer Lieber
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Natural language processing
Natural language processingNatural language processing
Natural language processingBasha Chand
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Recursion tree method
Recursion tree methodRecursion tree method
Recursion tree methodRajendran
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 

What's hot (20)

Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
NLP
NLPNLP
NLP
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Text MIning
Text MIningText MIning
Text MIning
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Recursion tree method
Recursion tree methodRecursion tree method
Recursion tree method
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 

Viewers also liked

Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language ProcessingJaganadh Gopinadhan
 
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKKnowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKAnne Thessen
 
OUTDATED Text Mining 2/5: Language Modeling
OUTDATED Text Mining 2/5: Language ModelingOUTDATED Text Mining 2/5: Language Modeling
OUTDATED Text Mining 2/5: Language ModelingFlorian Leitner
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnOlivier Grisel
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKJacob Perkins
 
Trend detection and analysis on Twitter
Trend detection and analysis on TwitterTrend detection and analysis on Twitter
Trend detection and analysis on TwitterLukas Masuch
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltkWei-Ting Kuo
 
GPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniGPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniBig Data Spain
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKOlivier Grisel
 

Viewers also liked (16)

Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKKnowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTK
 
NLTK Book Chapter 2
NLTK Book Chapter 2NLTK Book Chapter 2
NLTK Book Chapter 2
 
OUTDATED Text Mining 2/5: Language Modeling
OUTDATED Text Mining 2/5: Language ModelingOUTDATED Text Mining 2/5: Language Modeling
OUTDATED Text Mining 2/5: Language Modeling
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
NoSql
NoSqlNoSql
NoSql
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learn
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTK
 
Trend detection and analysis on Twitter
Trend detection and analysis on TwitterTrend detection and analysis on Twitter
Trend detection and analysis on Twitter
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
 
GPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniGPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo Molini
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
NLP
NLPNLP
NLP
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 

Similar to NLTK

What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...gagravarr
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...gagravarr
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 
Assignment4.pptx
Assignment4.pptxAssignment4.pptx
Assignment4.pptxjatinchand3
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 documentUma Kant
 
Natural_Language_Processing_1.ppt
Natural_Language_Processing_1.pptNatural_Language_Processing_1.ppt
Natural_Language_Processing_1.ppttestbest6
 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerSho Fola Soboyejo
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxrohithprabhas1
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrTrey Grainger
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Lucidworks
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 
UVA MDST 3073 Texts and Models-2012-09-11
UVA MDST 3073 Texts and Models-2012-09-11UVA MDST 3073 Texts and Models-2012-09-11
UVA MDST 3073 Texts and Models-2012-09-11Rafael Alvarado
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Josh Patterson
 

Similar to NLTK (20)

What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Assignment4.pptx
Assignment4.pptxAssignment4.pptx
Assignment4.pptx
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
 
Intro
IntroIntro
Intro
 
Intro
IntroIntro
Intro
 
Natural_Language_Processing_1.ppt
Natural_Language_Processing_1.pptNatural_Language_Processing_1.ppt
Natural_Language_Processing_1.ppt
 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text Summarizer
 
BT02.pptx
BT02.pptxBT02.pptx
BT02.pptx
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Taming Text
Taming TextTaming Text
Taming Text
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Python assignment help
Python assignment helpPython assignment help
Python assignment help
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
UVA MDST 3073 Texts and Models-2012-09-11
UVA MDST 3073 Texts and Models-2012-09-11UVA MDST 3073 Texts and Models-2012-09-11
UVA MDST 3073 Texts and Models-2012-09-11
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 

More from Girish Khanzode (9)

Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
IR
IRIR
IR
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Hadoop
HadoopHadoop
Hadoop
 
Language R
Language RLanguage R
Language R
 
Funtional Programming
Funtional ProgrammingFuntional Programming
Funtional Programming
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

NLTK

  • 2. Contents • Tokenization • Corpuses • Frequency Distribution • Stylistics • SentenceTokenization • WordNet • Stemming • Lemmatization • Part of SpeechTagging • Tagging Methods • UnigramTagging • N-gramTagging • Chunking – Shallow Parsing • Entity Recognition • SupervisedClassification • DocumentClassification • Hidden Markov Models - HMM • References
  • 3. NLTK • A set of Python modules to carry out many common natural language tasks. • Basic classes to represent data for NLP • Infrastructure to build NLP programs in Python • Python interface to over 50 corpora and lexical resources • Focus on Machine Learning with specific domain knowledge • Free and Open Source
  • 4. NLTK • Numpy and Scipy under the hood • Fast and Formal • Standard interfaces for tokenization, part-of-speech tagging, syntactic parsing and text classification • Windows: >>> import nltk >>> nltk.download('all') • Linux $ pip install --upgrade nltk
  • 5. NLTK -Top-Level Organization • Organized as a flat hierarchy of packages and modules • Each module provides the tools necessary to address a specific task • Modules has two types of classes – Data-oriented classes • Used to represent information relevant to natural language processing. – Task-oriented classes • Encapsulate the resources and methods needed to perform a specific task.
  • 6. Modules • Token - classes for representing and processing individual elements of text, such as words and sentences • Probability - classes for representing and processing probabilistic information • Tree - classes for representing and processing hierarchical information over text • Cfg - classes for representing and processing context free grammars
  • 7. Modules • Tagger - tagging each word with a part-of-speech, a sense, etc • Parser - building trees over text (includes chart, chunk and probabilistic parsers) • Classifier - classify text into categories (includes feature, featureSelection, maxent, naivebayes) • Draw - visualize NLP structures and processes • Corpus - access (tagged) corpus data
  • 8. Tokenization • Simplest way to represent a text is with a single string • Difficult to process text in this format • Convenient to work with a list of tokens • Task of converting a text from a single string to a list of tokens is known as tokenization • The most basic natural language processing technique • Example -WordTokenization Input : “Hey there, How are you all?” Output : “Hey”, “there,”, “How”, “are”, “you”, “all?”
  • 9. Tokens andTypes • The term word can be used in two different ways – To refer to an individual occurrence of a word – To refer to an abstract vocabulary item • For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items
  • 10. Tokens andTypes • To avoid confusion use more precise terminology – Word token - an occurrence of a word – WordType - a vocabulary item • Tokens constructed from their types using theToken constructor • Token member functions - type and loc
  • 11. Tokens andTypes >>> from nltk.token import * >>> my_word_type = 'dog‘ 'dog’ >>> my_word_token =Token(my_word_type) ‘dog'@[?]
  • 12. Text Locations • Text location @ [s:e] specifies a region of a text – s is the start index – e is the end index • Specifies the text beginning at s, and including everything up to (but not including) the text at e • Consistent with Python slice
  • 13. Text Locations • Think of indices as appearing between elements – I saw a man – 0 1 2 3 4 • Shorthand notation when location width = 1 • Indices based on different units – character – word – sentence
  • 14. Text Locations • Locations tagged with sources – files, other text locations – the first word of the first sentence in the file • Location member functions – start – end – unit – source
  • 15. Text Corpus • Large collection of text • Concentrate on a topic or open domain • May be raw text or annotated / categorized
  • 16. Corpuses • Gutenberg - selection of e-books from Project Gutenberg • Webtext - forum discussions, reviews, movie script • nps_chat - anonymized chats • Brown - 1 million word corpus, categorized by genre • Reuters - news corpus • Inaugural - inaugural addresses of presidents • Udhr - multilingual corpus
  • 17. Accessing Corpora • Corpora on disk - text files • NLTK provides Python modules / functions / classes that allow for accessing the corpora in a convenient way • It is quite an effort to write functions that read in a corpus especially when it comes with annotations • The task of reading in a corpus is needed in many NLP projects
  • 18. Accessing Corpora • # tell Python we want to use the Gutenberg corpus • from nltk.corpus import gutenberg • # which files are in this corpus? • print(gutenberg.fileids()) • >>> ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible- kjv.txt', ...]
  • 19. Accessing Corpora - RawText • # get the raw text of a corpus = one string • >>> emmaText = gutenberg.raw("austen-emma.txt") • # print the first 289 characters of the text • >>> emmaText = gutenberg.raw("austen-emma.txt") • >>> emmaText[:289] • '[Emma by Jane Austen 1816]nnVOLUME InnCHAPTER InnnEmmaWoodhouse, handsome, clever, and rich, with a comfortable homenand happy disposition, seemed to unite some of the best blessingsnof existence; and had lived nearly twenty-one years in the worldnwith very little to distress or vex her.‘
  • 20. Accessing Corpora -Words • # get the words of a corpus as a list • emmaWords = gutenberg.words("austen-emma.txt") • # print the first 30 words of the text • >>> print(emmaWords[:30]) • ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse‘, 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and‘, 'happy', 'disposition', ',', 'seemed']
  • 21. Accessing Corpora: Sentences • # get the sentences of a corpus as a list of lists - one list of words per sentence • >>> senseSents = gutenberg.sents("austen-sense.txt") • # print out the first four sentences • >>> print(senseSents[:4]) • [['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']'], ['CHAPTER', '1'], ['The', 'family', 'of', 'Dashwood', 'had', 'long‘, 'been', 'settled', 'in', 'Sussex', '.'], ['Their', 'estate', 'was', 'large', ',', 'and‘, 'their', 'residence', 'was', 'at', ...]]
  • 22. Counting • Use Inaugural Address text. • >>> from nltk.book import text4 • Counting vocabulary: the length of a text from start to fnish • >>> len(text4) • 145735 • How many distinct words? • >>> len(set(text4)) #types • 9754 • Richness of the text. • >>> len(text4) / len(set(text4)) • 14.941049825712529 • >>> 100 * text4.count('democracy') / len(text4) • 0.03568120218204275
  • 23. Positions of a Word inText Lexical Dispersion Plot
  • 24. List Elements Operations • List comprehension – >>> len(set([word.lower() for word in text4 if len(word)>5])) – 7339 – >>> [w.upper() for w in text4[0:5]] – ['FELLOW', '-', 'CITIZENS', 'OF', 'THE'] • Loops and conditionals • For word in text4[0:5]: if len(word)<5 and word.endswith('e'): print word, ' is short and ends with e‘ elif word.istitle(): print word, ' is a titlecase word‘ else: print word, 'is just another word'
  • 25. Brown Corpus • First million-word electronic corpus of English • Created at Brown University in 1961 • Text from 500 sources, categorized by genre • >>> from nltk.corpus import brown • >>> print(brown.categories()) • ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor‘, 'learned', 'lore', 'mystery', 'news', 'religion‘, 'reviews', 'romance', 'science_fiction']
  • 26. Brown Corpus – RetrieveWords by Category • >>> from nltk.corpus import brown • >>> news_words = brown.words(categories = "news") • >>> print(news_words) • ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's“, 'recent', 'primary', 'election', 'produced', ...]
  • 27. Brown Corpus – RetrieveWords by Category • >>> adv_words = brown.words(categories = "adventure") • >>> print(adv_words) • ['Dan', 'Morgan', 'told', 'himself', 'he', 'would‘, 'forget', 'Ann', 'Turner', '.', ...] • >>> reli_words = brown.words(categories = "religion") • >>> print(reli_words) • ['As', 'a', 'result', ',', 'although', 'we', 'still', 'make', 'use', 'of', 'this', 'distinction', ',',...]
  • 28. Frequency Distribution • Records how often each item occurs in a list of words • Frequency distribution over words • Basically a dictionary with some extra functionality • init creates a frequency distribution from a list of words
  • 29. Frequency Distribution • >>>news_words = brown.words(categories = "news") • >>>fdist = nltk.FreqDist(news_words) • >>>print("shoe:", fdist["shoe"]) • >>>print("the: ", fdist["the"])
  • 30. Frequency Distribution • # show the 10 most frequent words & frequencies • >>>fdist.tabulate(10) • the , . Of and to a in for The • 5580 5188 4030 2849 2146 2116 1993 1893 943 806
  • 31. Plot Frequency Distribution • Create a plot of the 10 most frequent words • >>>fdist.plot(10)
  • 32. Stylistics • Systematic differences between genres • Brown corpus with its categories is a convenient resource • Is there a difference in how the modal verbs (can, could, may, might, must, will) are used in the genres? • Let us look at the frequency distribution
  • 33. Stylistics • from nltk import FreqDist • # Define modals of interest • >>>modals = ["may", "could", "will"] • # Define genres of interest • >>>genres = ["adventure", "news", "government", "romance"] • # count how often they occur in the genres of interest • >>>for g in genres: • >>>words = brown.words(categories = g) • >>>fdist = FreqDist([w.lower() for w in words • >>> if w.lower() in modals]) • >>>print g, fdist
  • 34. Conditional Frequency Distributions • >>>from nltk import ConditionalFreqDist • >>>cfdist = ConditionalFreqDist() • >>>for g in genres: words = brown.words(categories = g) for w in words if w.lower() in modals: cfdist[g].inc(w.lower()) • >>> cfdist.tabulate() could may will Adventure 154 7 51 Government 38 179 244 News 87 93 389 Romance 195 11 49 • >>>cfdist.plot(title="Modals in various Genres")
  • 36. Processing RawText • Assume you have a text file on your disk... • # Read the text • >>> path = "holmes.txt“ • >>> f = open(path) • >>> rawText = f.read() • >>> f.close() • >>> print(rawText[:165]) • THE ADVENTURES OF SHERLOCK HOLMES • By • SIR ARTHUR CONAN DOYLE I. A Scandal in Bohemia II.The Red-headed League
  • 37. SentenceTokenization • # Split the text up into sentences • >>> sents = nltk.sent_tokenize(raw) • >>> print(sents[20:22]) • ['I had seen little of Holmes lately.', 'My marriage had drifted usrnaway from each other.‘, ...]
  • 38. WordTokenization • >>># Tokenize the sentences using nltk • >>>tokens = [] • >>>for sent in sents: tokens += nltk.word_tokenize(sent) • >>>print(tokens[300:350]) • [’such’, ’as’, ’his’, ’.’, ’And’, ’yet’, ’there’, ’was’, ’but’, ’one’, ’woman’, ’to’, ’him’, ’,’, ’and’, ’that’, ’woman’, ’was’, ’the’, ’late’, ’Irene’, ’Adler’, ’,’, ’of’, ’dubious’, ’and’, ’questionable’, ’memory’, ...]
  • 39. Creating aText Object • Using a list of tokens, we can create an nltk.Text object for a document. • Collocations = terms that occur together unusually often • Concordance view = shows the contexts in which a token occurs
  • 40. Creating aText Object • >>># Create a text object • >>>text = nltk.Text(tokens) • >>># Do stuff with the text object • >>>print(text.collocations()) • Sherlock Holmes; said Holmes; St. Simon; Baker Street; Lord St.; St. Clair; Mr. Holmes; HosmerAngel; Irene Adler; Miss Hunter; young lady; Briony Lodge; Stoke Moran; Neville St.; Miss Stoner; ScotlandYard; could see; Mr. Holmes.; Boscombe Pool; Mr. Rucastle
  • 41. ConcordanceView • >>>print(text.concordance("Irene")) • >>>Building index... • >>>Displaying 17 of 17 matches: • to love for IreneAdler . All emotions , and that one • was the late IreneAdler , of dubious and questionable • dventuress , IreneAdler .The name is no doubt familia • nd . " " And IreneAdler ? " "Threatens to send them t • se , of Miss IreneAdler . " " Quite so ; but the seque • And what of IreneAdler ? " I asked . " Oh , she has t • tying up of IreneAdler , spinster , to Godfrey Norton • ction . Miss Irene , or Madame , rather , returns from • ...
  • 42. Annotated Corpora • Example -The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn ... • Some corpora come with annotations - POS tags, parse trees,... • NLTK provides convenient access to these corpora (get the text + annotations) • DependencyTree Bank (e.g. Penn): collection of (dependency-) parsed sentences (manually annotated), can be used for training a statistical parser or parser evaluation
  • 43. WordNet • Structured, semantically oriented English dictionary • Synonyms, antonyms, hyponims, hypernims, depth of a synset, trees, entailments, etc. • >>> from nltk.corpus import wordnet as wn • >>> wn.synsets('motorcar') • [Synset('car.n.01')] • >>> wn.synset('car.n.01').lemma_names • ['car', 'auto', 'automobile', 'machine', 'motorcar']
  • 44. WordNet • >>> wn.synset('car.n.01').definition • 'a motor vehicle with four wheels; usually propelled by an internal combustion engine' • >>> for synset in wn.synsets('car')[1:3]: • ... print synset.lemma_names • ['car', 'railcar', 'railway_car', 'railroad_car'] ['car', 'gondola'] • >>> wn.synset('walk.v.01').entailments() • #Walking involves stepping • [Synset('step.v.01')]
  • 45. Getting InputText - HTML • >>> from urllib import urlopen • >>> url = "http://www.bbc.co.uk/news/science-environment-21471908" • >>> html = urlopen(url).read() • html[:60] • >>> raw = nltk.clean_html(html) • '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http' • >>> tokens = nltk.word_tokenize(raw) • >>> tokens[:15] • ['BBC', 'News', '-', 'Exoplanet', 'Kepler', '37b', 'is', 'tiniest‘, 'yet', '-', 'smaller', 'than', 'Mercury', 'Accessibility', 'links‘]
  • 46. Getting InputText - User • >>> s = raw_input("Enter some text: ") • Use your own files on disk • >>> f = open('C:DataFilesUK_natl_2010_en_Lab.txt') • >>> raw = f.read() • >>> print raw[:100] • #Foreword by Gordon Brown • This General Election is fought as our troops are bravely fighting to def
  • 47. Import Files as Corpus • >>> from nltk.corpus import PlaintextCorpusReader • >>> corpus_root = "C:/Data/Files/" • >>> wordlists = PlaintextCorpusReader(corpus_root, '.*.txt') • >>> wordlists.fileids()[:3] • ['UK_natl_1987_en_Con.txt', 'UK_natl_1987_en_Lab.txt', • 'UK_natl_1987_en_LibSDP.txt'] • >>> wordlists.words('UK_natl_2010_en_Lab.txt') • ['#', 'Foreword', 'by', 'Gordon', 'Brown', '.', 'This', ...]
  • 48. Stemming • Strip off affixes • >>>porter = nltk.PorterStemmer() • >>>[porter.stem(t) for t in tokens] • Porter stemmer lying - lie, women - women • >>>lancaster = nltk.LancasterStemmer() • >>>[lancaster.stem(t) for t in tokens] • Lancaster stemmer lying - lying, women - wom
  • 49. Lemmatization • Removes affixes if in dictionary • >>>wnl = nltk.WordNetLemmatizer() • >>>[wnl.lemmatize(t) for t in tokens] • lying - lying, women - woman
  • 50. Write Output to File • Save separated sentences text to a new file • >>>output_file = open('C:DataFilesoutput.txt', 'w') • >>>words = set(sents) • >>>for word in sorted(words): • >>> output_file.write(word + "n") • To write non-text data, first convert it to string - str() • Avoid filenames that contain space characters or that are identical except for case distinctions
  • 51. Part of SpeechTagging • POSTagging - Process of classifying words into their parts of speech & labelling them accordingly – Words grouped into classes, such as nouns, verbs, adjectives, and adverbs • Parts of speech are also known as word classes or lexical categories • The collection of tags used for a particular task is known as a tagset
  • 52. Part of SpeechTagging • NLTK tags text automatically – Predicting the behaviour of previously unseen words – Analyzing word usage in corpora – Text-to-speech systems – Powerful searches – Classification
  • 53. Tagging Methods • Default tagger • Regular expression tagger • Unigram tagger • N-gram taggers
  • 54. Tagging Methods • Can be combined using a technique known as backoff – when a more specialized model (such as a bigram tagger) cannot assign a tag in a given context, we backoff to a more general model (such as a unigram tagger) • Taggers can be trained and evaluated using tagged corpora
  • 55. Tagging Examples • Some corpora already tagged • >>> nltk.corpus.brown.tagged_words() • [('The', 'AT'), ('Fulton', 'NP-TL'), ...] • A simple example • >>> nltk.pos_tag(text) • [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] – CC is coordinating conjunction; RB is adverb; IN is preposition; NN is noun; JJ is adjective – Lots of others - foreign term, verb tenses, “wh” determiner etc
  • 56. Tagging Examples • An example with homonyms • >>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") • >>> nltk.pos_tag(text) • [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit‘, 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the‘, 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
  • 57. UnigramTagging • Unigram tagging - nltk.UnigramTagger() – Assign the tag that is most likely for that particular token – Train it specifying tagged sentence data as a parameter when we initialize the tagger – Separate training and testing data
  • 58. N-gramTagging • Context is the current word together with the part-of-speech • Tags of the n-1 preceding tokens • Evaluate performance • Contexts that were not present in the training data – accuracy vs. Coverage • Combine taggers
  • 59. Information Extraction • Search large bodies of unrestricted text for specific types of entities and relations • Move these in well-organized databases • Use these databases to find answers for specific questions
  • 60. Information Extraction - Steps • Segmenting, tokenizing, and part-of-speech tagging the text • Search resulting data for specific types of entity • Examine entities that are mentioned near one another in the text to determine if specific relationships hold between those entities
  • 61. Chunking – Shallow Parsing • Analyzes a sentence to identify the constituents noun groups, verbs, verb groups etc • However, it does not specify their internal structure, nor their role in the main sentence • The smaller boxes show word-level tokenization and part-of-speech tagging, while large boxes show higher-level chunking • Each of these larger boxes is called a chunk • Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens • Like tokenization, the pieces produced by a chunker do not overlap in the source text
  • 63. Entity Recognition • Entity recognition performed using chunkers – Segment multi-token sequences and label them with the appropriate entity type – ORGANIZATION, PERSON, LOCATION, DATE,TIME, MONEY, and GPE (geo-political entity) • Constructing chunkers – Use rule-based systems like RegexpParser class from NLTK – Using machine learning techniques like ConsecutiveNPChunker – POS tags are very important in this context.
  • 64. Relation Extraction • Rule-based systems - look for specific patterns in the text that connect entities and the intervening words • Machine-learning systems - attempt to learn patterns automatically from a training corpus
  • 65. ProcessingText • Choose a particular class label for a given input • Identify particular features of language data that are salient for classifying it • Construct models of language that can be used to perform language processing tasks automatically • Learn about text/language from these models • Machine learning techniques – Decision trees – Naive Bayes' classifiers – Maximum entropy classifiers
  • 66. Applications • Determining the topic of an article or a book • Deciding if an email is spam or not • Determining who wrote a text • Determining the meaning of a word in a particular context • Open-class classification - set of labels is not defined in advance • Multi-class classification - each instance may be assigned multiple labels • Sequence classification - a list of inputs are jointly classified
  • 68. Example – Identify Gender by Name • Relevant feature: last letter • Create a feature set (a dictionary) that maps feature’s names to their values – >>>def gender_features(word): – >>>return {'last_letter': word[-1]} • Import names, shuffle them – >>>from nltk.corpus import names – >>>import random – >>>names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]) – >>>random.shuffle(names)
  • 69. Example – Identify Gender by Name • Divide list of features into training set and test set – >>>featuresets = [(gender_features(n), g) for (n,g) in names] – >>>from nltk.classify import apply_features – >>>#Use apply if you're working with large corpora – >>>train_set = apply_features(gender_features, names[500:]) – >>>test_set = apply_features(gender_features, names[:500]) • Use training set to train a naive Bayes classier – >>>classifier = nltk.NaiveBayesClassifier.train(train_set)
  • 70. Example – Identify Gender by Name • Test the classier on unseen data – >>> classifier.classify(gender_features('Neo')) – >>>'male' – >>> classifier.classify(gender_features('Trinity')) – >>>'female‘ • >>> print nltk.classify.accuracy(classifier, test_set) – >>>0.744
  • 71. Example – Identify Gender by Name • Examine the classier to see which feature is most effective at distinguishing between classes • >>> classifier.show_most_informative_features(5) • Most Informative Features • last_letter = 'a' female : male = 35.7 : 1.0 • last_letter = 'k' male : female = 31.7 : 1.0 • last_letter = 'f' male : female = 16.6 : 1.0 • last_letter = 'p' male : female = 11.9 : 1.0 • last_letter = 'v' male : female = 10.5 : 1.0
  • 72. Example - Document Classification • Use corpora where documents have been labelled with categories – Build classifiers that will automatically tag new documents with appropriate category labels • Use the movie review corpus, which categorizes reviews as positive or negative to construct a list of documents • Define a feature extractor for documents - feature for each of the most frequent 2000 words in the corpus • Define a feature extractor that checks if words are present in a document • Train a classier to label new movie reviews
  • 73. Document Classification • Compute accuracy on the test set – >>> print nltk.classify.accuracy(classifier, test_set) – >>> 0.79 • Evaluation issues: size of the test set depends on number of labels, their balance and the diversity of the test. • Show most informative features • >>> classifier.show_most_informative_features(5) – Most Informative Features – contains(outstanding) =True pos : neg = 11.2 : 1.0 – contains(mulan) =True pos : neg = 8.9 : 1.0 – contains(wonderfully) =True pos : neg = 8.5 : 1.0 – contains(seagal) =True neg : pos = 8.3 : 1.0 – contains(damon) =True pos : neg = 6.0 : 1.0
  • 74. Context • Contextual features often provide powerful clues for classification • Context-dependent feature extractor - pass in a complete (untagged) sentence, along with the index of the target word • Joint classier models - choose an appropriate labelling for a collection of related inputs
  • 75. Sequence Classification • Jointly choose part-of-speech tags for all the words in a given sentence • Consecutive classification - find the most likely class label for the first input, then to use that answer to help find the best label for the next input, repeat • Feature extraction function needs to take a history argument - list of tags predicted so far
  • 76. Hidden Markov Models - HMM • Use inputs and the history of predicted tags • Generate a probability distribution over tags • Combine probabilities to calculate scores for sequences • Choose tag sequence with the highest probability
  • 77. More Advanced Models • Maximum Entropy Markov Models • Linear-ChainConditional Random Field Models
  • 78. References 1. Indurkhya, Nitin and Fred Damerau (eds, 2010) Handbook of Natural Language Processing (Second Edition)Chapman & Hall/CRC. 2010. (Indurkhya & Damerau, 2010) (Dale, Moisl, & Somers, 2000) 2. Jurafsky, Daniel and James Martin (2008) Speech and Language Processing (Second Edition). Prentice Hall. (Jurafsky & Martin, 2008) 3. Mitkov, Ruslan (ed, 2003)The Oxford Handbook of Computational Linguistics.Oxford University Press. (second edition expected in 2010). (Mitkov, 2002) 4. Bird, Steven; Klein, Ewan; Loper, Edward (2009). Natural Language Processing with Python. O'Reilly Media Inc 5. Perkins, Jacob (2010). PythonText Processing with NLTK 2.0 Cookbook. Packt Publishing 6. Bird, Steven; Klein, Ewan; Loper, Edward; Baldridge, Jason (2008) Proceedings of theThirdWorkshop on Issues inTeaching Computational Linguistics,ACL
  • 79. ThankYou Check Out My LinkedIn Profile at https://in.linkedin.com/in/girishkhanzode