Microsoft New England Research and Development Center, June 22, 2011Natural Language Processing and Machine LearningUsing PythonShankar Ambady|
Example FilesHosted on Githubhttps://github.com/shanbady/NLTK-Boston-Python-Meetup
What is “Natural Language Processing”?Where is this stuff used?The Machine learning paradoxA look at a few key termsQuick start – creating NLP apps in Python
What is Natural Language Processing?	Computer aided text analysis of human language.
 The goal is to enable machines to understand human language and extract meaning from text.
 It is a field of study which falls under the category of machine learning and more specifically computational linguistics.
The “Natural Language Toolkit” is a python module that provides a variety of functionality that will aide us in processing text.Natural language processing is heavily used throughout all web technologies
Paradoxes in Machine Learning
ContextLittle sister: What’s your name?Me: Uhh….Shankar..?Sister: Can you spell it?Me: yes. S-H-A-N-K-A…..
Sister: WRONG! It’s spelled “I-T”
Language translation is a complicated matter!Go to:  http://babel.mrfeinberg.com/The problem with communication is the illusion that it has occurred
The problem with communication is the illusion that it has occurredDas Problem mit Kommunikation ist die Illusion, dass es aufgetreten istThe problem with communication is the illusion that it aroseDas Problem mit Kommunikation ist die Illusion, dass es entstandThe problem with communication is the illusion that it developedDas Problem mit Kommunikation ist die Illusion, die sie entwickelteThe problem with communication is the illusion, which developed it
The problem with communication is the illusion that it has occurredThe problem with communication is the illusion, which developed itEPIC FAIL
Police policepolice.The above statement is a complete sentence states that:  “police officers police other police officers”.She yelled “police police police”“someone called for police”.
Key Terms
The NLP Pipeline
Setting up NLTKSource downloads available for mac and linux as well as installable packages for windows.Currently only available for Python 2.5 – 2.6http://www.nltk.org/download`easy_install nltk`PrerequisitesNumPySciPy
First stepsNLTK comes with packages of corpora that are required for many modules. Open a python interpreter:importnltknltk.download() If you do not want to use the downloader with a gui (requires TKInter module)Run: python -m nltk.downloader <name of package or “all”>
You may individually select packages or download them in bulk.
Let’s dive into some code!
Part of Speech Taggingfromnltkimportpos_tag,word_tokenizesentence1='this is a demo that will show you how to detects parts of speech with little effort using NLTK!'tokenized_sent=word_tokenize(sentence1)printpos_tag(tokenized_sent)[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('demo', 'NN'), ('that', 'WDT'), ('will', 'MD'), ('show', 'VB'), ('you', 'PRP'), ('how', 'WRB'), ('to', 'TO'), ('detects', 'NNS'), ('parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('with', 'IN'), ('little', 'JJ'), ('effort', 'NN'), ('using', 'VBG'), ('NLTK', 'NNP'),('!', '.')]
Penn Bank Part-of-Speech TagsSource: http://www.ai.mit.edu/courses/6.863/tagdef.html
NLTK Textnltk.clean_html(rawhtml)from nltk.corpus import brownfrom nltk import Textbrown_words = brown.words(categories='humor')brownText = Text(brown_words)brownText.collocations()brownText.count("car")brownText.concordance("oil")brownText.dispersion_plot(['car', 'document', 'funny', 'oil'])brownText.similar('humor')
Find similar terms (word definitions) using Wordnetimportnltkfromnltk.corpusimportwordnetaswnsynsets=wn.synsets('phone')print[str(syns.definition)forsynsinsynsets] 'electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds‘'(phonetics) an individual sound unit of speech without concern as to whether or not it is a phoneme of some language‘'electro-acoustic transducer for converting electric signals into sounds; it is held over or inserted into the ear‘'get or try to get into communication (with someone) by telephone'
from nltk.corpus import wordnet as wnsynsets = wn.synsets('phone')print [str(  syns.definition ) for syns in synsets]“syns.definition” can be modified to output hypernyms , meronyms, holonyms etc:
Fun things to Try
Feeling lonely?Eliza is there to talk to you all day! What human could ever do that for you??fromnltk.chatimportelizaeliza.eliza_chat()……starts the chatbotTherapist---------Talk to the program by typing in plain English, using normal upper-and lower-case letters and punctuation.  Enter "quit" when done.========================================================================Hello.  How are you feeling today?
Englisch to German to Englisch to German……fromnltk.bookimport*babelize_shell()Babel> the internet is a series of tubesBabel> germanBabel> run0> the internet is a series of tubes1> das Internet ist eine Reihe SchlSuche2> the Internet is a number of hoses3> das Internet ist einige SchlSuche4> the Internet is some hosesBabel>
Let’s build something even cooler
Lets write a spam filter!	A program that analyzes legitimate emails “Ham” as well as “spam” and learns the features that are associated with each.	Once trained, we should be able to run this program on incoming mail and have it reliably label each one with the appropriate category.
What you will needNLTK (of course) as well as the “stopwords” corpusA good dataset of emails; Both spam and hamPatience and a cup of coffee 	(these programs tend to take a while to complete)
Finding Great Data: The Enron EmailsA dataset of 200,000+ emails made publicly available in 2003 after the Enron scandal.Contains both spam and actual corporate ham mail.For this reason it is one of the most popular datasets used for testing and developing anti-spam software.The dataset we will use is located at the following url:http://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/preprocessed/It contains a list of archived files that contain plaintext emails in two folders , Spam and Ham.
Extract one of the archives from the site into your working directory. Create a python script, lets call it “spambot.py”.Your working directory should contain the “spambot” script and the folders “spam” and “ham”.“Spambot.py”fromnltkimportword_tokenize,\ WordNetLemmatizer,NaiveBayesClassifier\,classify,MaxentClassifierfromnltk.corpusimportstopwordsimportrandomimportos,glob,re
“Spambot.py” (continued)wordlemmatizer = WordNetLemmatizer()commonwords = stopwords.words('english') hamtexts=[]spamtexts=[]forinfileinglob.glob(os.path.join('ham/','*.txt')):text_file=open(infile,"r")hamtexts.extend(text_file.read())text_file.close()forinfileinglob.glob(os.path.join('spam/','*.txt')):text_file=open(infile,"r")spamtexts.extend(text_file.read())text_file.close()load common English words into liststart globbing the files into the appropriate lists
“Spambot.py” (continued)mixedemails=([(email,'spam')foremailinspamtexts]mixedemails+=   [(email,'ham')foremailinhamtexts])random.shuffle(mixedemails)label each item with the appropriate label and store them as a list of tuplesFrom this list of random but labeled emails, we will defined a “feature extractor” which outputs a feature set that our program can use to statistically compare spam and ham. lets give them a nice shuffle
“Spambot.py” (continued)defemail_features(sent):features={}wordtokens=[wordlemmatizer.lemmatize(word.lower())forwordinword_tokenize(sent)]forwordinwordtokens:ifwordnotincommonwords:features[word]=Truereturnfeaturesfeaturesets=[(email_features(n),g)for(n,g)inmixedemails]Normalize wordsIf the word is not a stop-word then lets consider it a “feature”Let’s run each email through the feature extractor  and collect it in a “featureset” list
The features you select must be binary features such as the existence of words or part of speech tags (True or False).
To use features that are non-binary such as number values, you must convert it to a binary feature. This process is called “binning”.
If the feature is the number 12 the feature is: (“11<x<13”, True)“Spambot.py” (continued)Lets grab a sampling of our featureset. Changing this number will affect the accuracy of the classifier. It will be a different number for every classifier, find the most effective threshold for your dataset through experimentation.size=int(len(featuresets)*0.5)train_set,test_set=featuresets[size:],featuresets[:size]classifier=NaiveBayesClassifier.train(train_set)Using this threshold grab the first n elements of our featureset and the last n elements to populate our “training” and “test”  featuresetsHere we start training the classifier using NLTK’s built in Naïve Bayes classifier
“Spambot.py” (continued)print classifier.labels()This will output the labels that our classifier will use to tag new data['ham', 'spam']The purpose of create a “training set” and a “test set” is to test the accuracy of our classifier on a separate sample from the same data source.printclassify.accuracy(classifier,test_set)0.75623
classifier.show_most_informative_features(20)SpamHam
“Spambot.py” (continued)WhileTrue:featset=email_features(raw_input("Enter text to classify: "))printclassifier.classify(featset)We can now directly input new email and have it classified as either Spam or Ham
A few notes:The quality of your input data will affect the accuracy of your classifier.
 The threshold value that determines the sample size of the feature set will need to be refined until it reaches its maximum accuracy. This will need to be adjusted if training data is added, changed or removed.A few notes:The accuracy of this dataset can be misleading; In fact our spambot has an accuracy of 98% - but this only applies to Enron emails. This is known as “over-fitting” .
Try classifying your own emails using this trained classifier and you will notice a sharp decline in accuracy.Real-time ReviewsWe will use the same classification script but modify it to detect positive and negative movie tweets in real-time!You will need the rotten tomatoes polarity dataset:http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
Two LabelsFreshrottenFlixster, Rotten Tomatoes, the Certified Fresh Logo are trademarks or registered trademarks of Flixster, Inc. in the United States and other countries
easy_installtweetstreamimporttweetstreamwords=["green lantern"]withtweetstream.TrackStream(“yourusername",“yourpassword",words)asstream:fortweetinstream:tweettext=tweet.get('text','')tweetuser=tweet['user']['screen_name'].encode('utf-8')featset=review_features(tweettext)printtweetuser,': ',tweettextprinttweetuser,“ thinks Green Lantern is ",classifier.classify(featset),\"\n\n\n----------------------------"

Nltk - Boston Text Analytics

  • 1.
    Microsoft New EnglandResearch and Development Center, June 22, 2011Natural Language Processing and Machine LearningUsing PythonShankar Ambady|
  • 2.
    Example FilesHosted onGithubhttps://github.com/shanbady/NLTK-Boston-Python-Meetup
  • 3.
    What is “NaturalLanguage Processing”?Where is this stuff used?The Machine learning paradoxA look at a few key termsQuick start – creating NLP apps in Python
  • 4.
    What is NaturalLanguage Processing? Computer aided text analysis of human language.
  • 5.
    The goalis to enable machines to understand human language and extract meaning from text.
  • 6.
    It isa field of study which falls under the category of machine learning and more specifically computational linguistics.
  • 7.
    The “Natural LanguageToolkit” is a python module that provides a variety of functionality that will aide us in processing text.Natural language processing is heavily used throughout all web technologies
  • 8.
  • 9.
    ContextLittle sister: What’syour name?Me: Uhh….Shankar..?Sister: Can you spell it?Me: yes. S-H-A-N-K-A…..
  • 10.
    Sister: WRONG! It’sspelled “I-T”
  • 11.
    Language translation isa complicated matter!Go to: http://babel.mrfeinberg.com/The problem with communication is the illusion that it has occurred
  • 12.
    The problem withcommunication is the illusion that it has occurredDas Problem mit Kommunikation ist die Illusion, dass es aufgetreten istThe problem with communication is the illusion that it aroseDas Problem mit Kommunikation ist die Illusion, dass es entstandThe problem with communication is the illusion that it developedDas Problem mit Kommunikation ist die Illusion, die sie entwickelteThe problem with communication is the illusion, which developed it
  • 13.
    The problem withcommunication is the illusion that it has occurredThe problem with communication is the illusion, which developed itEPIC FAIL
  • 14.
    Police policepolice.The abovestatement is a complete sentence states that: “police officers police other police officers”.She yelled “police police police”“someone called for police”.
  • 15.
  • 16.
  • 18.
    Setting up NLTKSourcedownloads available for mac and linux as well as installable packages for windows.Currently only available for Python 2.5 – 2.6http://www.nltk.org/download`easy_install nltk`PrerequisitesNumPySciPy
  • 19.
    First stepsNLTK comeswith packages of corpora that are required for many modules. Open a python interpreter:importnltknltk.download() If you do not want to use the downloader with a gui (requires TKInter module)Run: python -m nltk.downloader <name of package or “all”>
  • 20.
    You may individuallyselect packages or download them in bulk.
  • 21.
  • 22.
    Part of SpeechTaggingfromnltkimportpos_tag,word_tokenizesentence1='this is a demo that will show you how to detects parts of speech with little effort using NLTK!'tokenized_sent=word_tokenize(sentence1)printpos_tag(tokenized_sent)[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('demo', 'NN'), ('that', 'WDT'), ('will', 'MD'), ('show', 'VB'), ('you', 'PRP'), ('how', 'WRB'), ('to', 'TO'), ('detects', 'NNS'), ('parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('with', 'IN'), ('little', 'JJ'), ('effort', 'NN'), ('using', 'VBG'), ('NLTK', 'NNP'),('!', '.')]
  • 23.
    Penn Bank Part-of-SpeechTagsSource: http://www.ai.mit.edu/courses/6.863/tagdef.html
  • 24.
    NLTK Textnltk.clean_html(rawhtml)from nltk.corpusimport brownfrom nltk import Textbrown_words = brown.words(categories='humor')brownText = Text(brown_words)brownText.collocations()brownText.count("car")brownText.concordance("oil")brownText.dispersion_plot(['car', 'document', 'funny', 'oil'])brownText.similar('humor')
  • 25.
    Find similar terms(word definitions) using Wordnetimportnltkfromnltk.corpusimportwordnetaswnsynsets=wn.synsets('phone')print[str(syns.definition)forsynsinsynsets] 'electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds‘'(phonetics) an individual sound unit of speech without concern as to whether or not it is a phoneme of some language‘'electro-acoustic transducer for converting electric signals into sounds; it is held over or inserted into the ear‘'get or try to get into communication (with someone) by telephone'
  • 26.
    from nltk.corpus importwordnet as wnsynsets = wn.synsets('phone')print [str( syns.definition ) for syns in synsets]“syns.definition” can be modified to output hypernyms , meronyms, holonyms etc:
  • 27.
  • 28.
    Feeling lonely?Eliza isthere to talk to you all day! What human could ever do that for you??fromnltk.chatimportelizaeliza.eliza_chat()……starts the chatbotTherapist---------Talk to the program by typing in plain English, using normal upper-and lower-case letters and punctuation. Enter "quit" when done.========================================================================Hello. How are you feeling today?
  • 29.
    Englisch to Germanto Englisch to German……fromnltk.bookimport*babelize_shell()Babel> the internet is a series of tubesBabel> germanBabel> run0> the internet is a series of tubes1> das Internet ist eine Reihe SchlSuche2> the Internet is a number of hoses3> das Internet ist einige SchlSuche4> the Internet is some hosesBabel>
  • 30.
  • 31.
    Lets write aspam filter! A program that analyzes legitimate emails “Ham” as well as “spam” and learns the features that are associated with each. Once trained, we should be able to run this program on incoming mail and have it reliably label each one with the appropriate category.
  • 32.
    What you willneedNLTK (of course) as well as the “stopwords” corpusA good dataset of emails; Both spam and hamPatience and a cup of coffee (these programs tend to take a while to complete)
  • 33.
    Finding Great Data:The Enron EmailsA dataset of 200,000+ emails made publicly available in 2003 after the Enron scandal.Contains both spam and actual corporate ham mail.For this reason it is one of the most popular datasets used for testing and developing anti-spam software.The dataset we will use is located at the following url:http://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/preprocessed/It contains a list of archived files that contain plaintext emails in two folders , Spam and Ham.
  • 34.
    Extract one ofthe archives from the site into your working directory. Create a python script, lets call it “spambot.py”.Your working directory should contain the “spambot” script and the folders “spam” and “ham”.“Spambot.py”fromnltkimportword_tokenize,\ WordNetLemmatizer,NaiveBayesClassifier\,classify,MaxentClassifierfromnltk.corpusimportstopwordsimportrandomimportos,glob,re
  • 35.
    “Spambot.py” (continued)wordlemmatizer =WordNetLemmatizer()commonwords = stopwords.words('english') hamtexts=[]spamtexts=[]forinfileinglob.glob(os.path.join('ham/','*.txt')):text_file=open(infile,"r")hamtexts.extend(text_file.read())text_file.close()forinfileinglob.glob(os.path.join('spam/','*.txt')):text_file=open(infile,"r")spamtexts.extend(text_file.read())text_file.close()load common English words into liststart globbing the files into the appropriate lists
  • 36.
    “Spambot.py” (continued)mixedemails=([(email,'spam')foremailinspamtexts]mixedemails+= [(email,'ham')foremailinhamtexts])random.shuffle(mixedemails)label each item with the appropriate label and store them as a list of tuplesFrom this list of random but labeled emails, we will defined a “feature extractor” which outputs a feature set that our program can use to statistically compare spam and ham. lets give them a nice shuffle
  • 37.
  • 38.
    The features youselect must be binary features such as the existence of words or part of speech tags (True or False).
  • 39.
    To use featuresthat are non-binary such as number values, you must convert it to a binary feature. This process is called “binning”.
  • 40.
    If the featureis the number 12 the feature is: (“11<x<13”, True)“Spambot.py” (continued)Lets grab a sampling of our featureset. Changing this number will affect the accuracy of the classifier. It will be a different number for every classifier, find the most effective threshold for your dataset through experimentation.size=int(len(featuresets)*0.5)train_set,test_set=featuresets[size:],featuresets[:size]classifier=NaiveBayesClassifier.train(train_set)Using this threshold grab the first n elements of our featureset and the last n elements to populate our “training” and “test” featuresetsHere we start training the classifier using NLTK’s built in Naïve Bayes classifier
  • 41.
    “Spambot.py” (continued)print classifier.labels()Thiswill output the labels that our classifier will use to tag new data['ham', 'spam']The purpose of create a “training set” and a “test set” is to test the accuracy of our classifier on a separate sample from the same data source.printclassify.accuracy(classifier,test_set)0.75623
  • 42.
  • 43.
    “Spambot.py” (continued)WhileTrue:featset=email_features(raw_input("Enter textto classify: "))printclassifier.classify(featset)We can now directly input new email and have it classified as either Spam or Ham
  • 44.
    A few notes:Thequality of your input data will affect the accuracy of your classifier.
  • 45.
    The thresholdvalue that determines the sample size of the feature set will need to be refined until it reaches its maximum accuracy. This will need to be adjusted if training data is added, changed or removed.A few notes:The accuracy of this dataset can be misleading; In fact our spambot has an accuracy of 98% - but this only applies to Enron emails. This is known as “over-fitting” .
  • 46.
    Try classifying yourown emails using this trained classifier and you will notice a sharp decline in accuracy.Real-time ReviewsWe will use the same classification script but modify it to detect positive and negative movie tweets in real-time!You will need the rotten tomatoes polarity dataset:http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
  • 47.
    Two LabelsFreshrottenFlixster, RottenTomatoes, the Certified Fresh Logo are trademarks or registered trademarks of Flixster, Inc. in the United States and other countries
  • 48.
  • 50.
    Further Resources:I willbe uploading this presentation to my site:http://www.shankarambady.com“Natural Language Processing with Python” by Steven Bird, Ewan Klein, and Edward Loperhttp://www.nltk.org/bookAPI reference : http://nltk.googlecode.com/svn/trunk/doc/api/index.htmlGreat NLTK blog:http://streamhacker.com/
  • 51.
    Thank you forwatching!Special thanks to:John Verostek and Microsoft!

Editor's Notes

  • #24 You may also do #print [str(syns.examples) for syns in synsets] for usage examples of each definition