Document Classification using the Natural Language Toolkit<br />Ben Healey<br /><br />@BenHealey<br />
Source: IStockPhoto<br /><br />The Need for Automation<br />
Take urpick!<br /><br />
Features:<br />- # Words<br />- % ALLCAPS<br />- Unigrams<br />- Sender<br />- And so on.<br />Class:<br />The Development...
Relevant NLTK Modules<br />Feature Extraction<br />from nltk.corpus import words, stopwords<br />from nltk.stem import Por...
NaiveBayesClassifier<br />P(label|features)=P(label) ∗ P(features|label) P(features)<br /> <br />P(label|features)=P(label...<br />
517,431 Emails<br />Source: IStockPhoto<br />
Prep: Extract and Load<br />Sample* of 20,581 plaintext files<br />import MySQLdb, os, random, string<br /> MySQL via Pyt...
Prep: Extract and Load<br />Allocation of random number<br />Some feature extraction<br />#To, #CCd, #Words, %digits, %CAP...
From:<br />To:<br />Subject: Re: Agenda for FERC Meeting RE: EOL<br />Lou...
From:<br />To:<br />Subject: Start Date: 1/11/02; HourAhead hour: 5;<br />Start ...
Class[es] assigned for 1,000 randomly selected messages:<br />
Prep: Show us ur Features<br />NLTK toolset<br />from nltk.corpus import words, stopwords<br />from nltk.stem import Porte...
Prep: Show us ur Features<br />Features in boolean or nominal form<br />if record['num_words_in_body']<=20:<br />features[...
Prep: Show us ur Features<br />Features in boolean or nominal form<br />text=record['msg_subject']+" "+record['msg_body']<...
Sit. Say. Heel.<br />random.shuffle(dev_set)<br />cutoff = len(dev_set)*2/3<br />train_set=dev_set[:cutoff]<br />test_set=...
Most Important Features<br />
Most Important Features<br />
Most Important Features<br />
Performance: ‘IT’ Model<br />IMPORTANT: These are ‘cheat’ scores!<br />
Performance: ‘Deal’ Model<br />IMPORTANT: These are ‘cheat’ scores!<br />
Performance: ‘Social’ Model<br />IMPORTANT: These are ‘cheat’ scores!<br />
Don’t get burned.<br /><ul><li>Biased samples
Accuracy and rare events
Upcoming SlideShare
Loading in …5

Document Classification using the Python Natural Language Toolkit


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Thanks for coming along. I am Ben Healey and this talk will be about using the python-based Natural Language Toolkit to automate document classification. My background is in market research and analytics, so my day job primarily involves coding in SAS, working with databases and excel to extract business insights. I also advise on survey design and development. I am relatively new to Python and have recently had reason to use the Natural Language Toolkit to help me with some document classification I need to do. So, when the Kiwi Pycon call for papers came around I thought this would be a good opportunity to learn some more about this process and share my experience with others.
  • My aim today is to cover the overall process involved in developing a document classification algorithm using the NLTK. You’ll come away with an understanding of where to start if you want to do something similar yourself. I’ll also introduce some terms specific to Machine Learning and the NLTK. In particular, I’ll cover:The need for automated classification.Reasons to use Natural Language Toolkit, and the alternatives available.Machine Learning tools in the NLTK, and the Naïve Bayes classifier in particular.Using python for data preparation.Training and assessing your classifier.Tricks and traps.And, finally, further resources to explore.
  • The need for automated approaches to common document classification tasks is self evident. Individuals and businesses are dealing with an increasing volume of information in a variety of formats, be it text, audio, image or video. You may think of web-based content generators and social media as being key contributors to this deluge, but organisations are also collecting more complex data internally from customers, employees and other stakeholders that they need to make sense of.Unfortunately, people don’t scale. We are comparatively expensive and get tired or bored easily. We are also quite subjective and vary from day to day in the way we apply classifications. So there are potential financial, consistency and time gains to be made from automating repetitive document classification tasks.Some common applications include:Spam filtering Document language identification, for subsequent translation. For instance, ‘Google translate’ uses this to automatically detect the language of a text snippet you want to translate into English.Sentiment analysis. For instance, and organization may analyse the tweets about its new product to track whether or not the comments are generally positive, negative, or neutral over time.Google Adsense and other contextual banner ad networks, which choose the banner ads to place on a webpage by first classifying the topic of the text on that page.One application that is not common, but which we’ll examine today, is document classification to aid the legal discovery process. I managed to find one organisation called Blackstone Discovery that is doing this sort of thing commercially. So, while the example I’ll use today is contrived, it does reflect a real-world problem.
  • Before getting into the example, it is worth considering the process and tools we’ll be using along the way. I’ve chosen the Natural Language Toolkit for this talk, but there are a range of other tools that can be used with python to build automatic classifiers. Examples include:Sci-Kits, which has a number of machine learning modules in its ‘learn’ package.Orange, which is a data visualization and analysis toolkit written in python. It even has a visual programming interface. And the ever popular R language, which can be accessed via a Python interface called RPy. For a good discussion of these alternatives and others, do a search on StackOverflow for the phrase ‘machine learning using python’.The reason I’ve chosen the Natural Language Toolkit for this talk is that it has a particular focus on document analysis. As such, it contains some very useful tools for manipulating text that don’t come as standard in some of the other python-based alternatives. Since it also contains a number of fairly standard machine learning classifiers, it is a great tool to get started with automated document classification.
  • This diagram outlines the process for building an automatic classifier using supervised machine learning. The bottom line presents the end-goal, which is to be able to:Take a document that we do not have a classification forExtract some features from that document, andRun those features through our trained classifier, so thatWe can get a predicted (and accurate) classification for the document. In order to achieve this, the classifier has to be developed (trained) using a set of documents for which the classification is known. Ideally, the development set will contain a large number of documents that are representative of the kind of documents you’ll want to classify automatically. For each document in the development set you’ll need to extract a range of features to feed into the classifier during the training stage. Features could include things like the number of words in the document, whether or not a specific word (unigram) or group of words (ngram) appears in the document, who the sender of the document was, the percentage of the document written in ALLCAPS, and so on.Different machine learning algorithms can be used, and some are more appropriate for document classification tasks than others. But we’ll talk about that some more later. At this stage, the key things to take away from the diagram are that the quality of your development set, your choice of features to extract, and the machine learning algorithm you select will all have an impact on the reliability of your trained classifier.
  • The modules listed here come bundled with the NLTK. They can be used in the feature extraction and machine learning phases of the document classification process. These are just a sampling of what is available, so you should have a look through the NLTK documentation to identify other components that could fit your use-case.Looking first at feature extraction:The corpus module contains ‘words’ and ‘stopwords’. ‘words’ is a list of English words, which can be useful if you want to check whether your documents contain a high proportion of common English words. You might do this if you are building a classifier to tell if the document is English or not. You could also use it to help detect if a document contained high levels of ‘jargon’ terms. ‘stopwords’ is a list of high frequency words such as ‘the’, ‘to’ and ‘also’ that you often want to filter out of the set of words in a document that make it through to the feature list.The stem module contains tools for normalising the words in document to their common root. For instance, the words ‘programmer’, ‘programs’, ‘programming’ and ‘program’ would all be stemmed to ‘program’ by the Lancaster stemmer. Stemmers are useful because they help your classification algorithm use the root of the word as a signal to pick up on, rather than using different variants of the word as different signals.The tokenize module provides tools to split your text down from a collection of words and symbols into discrete units. For instance, the WordPunctTokenizer breaks a text into words or sub-words based on the presence of punctuation and whitespace. Tokenizing is the common way to get a list of unigrams from your document that can then be included as features to be passed to your classifier for training.You can try out different stemmers, tokenizers and other NLTK tools on the streamhacker blog by Jacob Perkins (see or also contains tools for finding and selecting collocations, or words that commonly appear together. For instance, the Bigram tools help identify and select two-word combinations that appear throughout a text. There is also a set related to Trigrams (three word combinations). The presence (or not) of these word combinations can also be included as features to be passed in to your classifier for training.Turning to machine learning algorithms, the Natural Language Toolkit contains a handful of commonly used classifiers including a Naïve Bayesian classifier, a Decision Tree classifier and a Multinomial Logit classifier (called Maxent). It also provides an interface to the Java-based ‘Weka’ collection of open source machine learning algorithms from the University of Waikato. It is beyond the scope of this talk, and my knowledge, to go into the details of each classifier. However, each classifier has its own profile of computational requirements, flexibility and interpretability. So, it is worth investigating the most appropriate classifier for your purpose and also testing different classifiers to see which produce the best outcomes for your datasets.In addition to the classification algorithms themselves, NLTK provides some tools for assessing the performance of your trained classifier under the ‘classify.utils’ module. The ‘accuracy’ test is just one of these, and we’ll look at some alternative ways to assess your classifier when we run through the example.
  • I used a Naïve Bayesian classifier for today’s example. In essence, this classifier generates an estimate of the likelihood that a document is of the class of interest (or label, in the above formulae) by looking at the features that it has and computing the likelihood that a document with those features would be of the class of interest. The likelihood for each feature and label is determined during the training phase, by iterating through the development set of documents for which both the features and label are known. Thus, the classifier extrapolates the probabilities it establishes during training to new documents it encounters. The classifier is Bayesian because it employs Bayes&apos; theorem to build up conditional probability distributions for the labels and features. It is considered naïve because it assumes that the occurrence of one feature is unrelated to the occurrence of another feature. That is, it assumes the features are independent. This assumption is frequently incorrect. Nevertheless, Bayesian classifiers have proven to perform well despite this apparent flaw in logic! Those interested in learning more about the mathematics underlying the classifier should head over the related Wikipedia entries for ‘naïve Bayesian classifier’ and ‘Bayesian spam filtering’, which contain some worked examples.
  • Now that we have covered the process and components necessary to build a document classifier using the NLTK, I’d like to jump into an example. For those of you who haven’t heard about Enron, it was a very large US energy and commodities trading company which went spectacularly bankrupt in late 2001. The ripples of its bankruptcy were felt throughout the American financial and political system. Subsequent investigations surrounding its accounting practices led to the arrest and imprisonment of a number of its key executives, along with the downfall of one of the world’s largest accounting firms, Arthur Andersen.As part of the investigations that went on, Enron was required to provide prosecutors with the organisation’s emails for a period prior to the collapse. Those emails subsequently made it into the public domain, and I thought they’d be an interesting set of documents to build some classifiers on for this talk.
  • In total, just over half a million emails were released as part of the Enron document discovery process. That’s a lot of documents to try and look through if you are a prosecutor attempting to identify key documents for your case. It is likely to be prohibitively expensive to hire a group of people (even interns!) to look through all of the documents. It would probably also be very difficult to build a classifier to try and pick out ‘key documents’ since, by definition, these are rare. If you think about these key documents as needles in a haystack, you would spend an inordinate amount of time trying to find examples to feed into your development set. By the time you’d found them, you would have probably looked through half of the emails already!Of course, if you had some keywords you knew would exist in the documents, or you knew who the key players were in the case, you could search for key terms or only read the emails from certain individuals. But if these avenues were exhausted, another approach you could take is to try and reduce the size of the haystack.Specifically, if you could identify the types of documents that are very unlikely to contain information you are interested in, and then have a classifier run over the entire set to identify documents that are likely to be of those types, you could reduce your workload significantly. This would allow you to focus your efforts on documents that weren’t clearly ‘hay’.
  • Although the full set of 517 thousand messages was available as an archive, it had not had much in the way of elementary data cleaning done to it. So, instead I have used a subset of just over 20 thousand messages that have already gone through some cleaning. There is a link to the download at the end of the talk for those interested. One point to note is that the subset comes from seven Enron employees who had a particularly large number of emails in their folders. So, the subset is not likely to be representative of the entire message set and, as a result, any classifiers built on them could not be expected to perform well over the entire set.The messages came as separate plaintext files within a folder structure reflecting the owner and email subfolder the message was assigned by the owner. Although I have not used these elements of structure as features in developing the classifiers here, this sort of information would be relevant to the classification task and might be included in a more comprehensive modelling project.I used the MySQL ODBC interface for python, along with functions from the standard os and string python modules, to import each plaintext file, separate out some key fields such as the sender, recipients, subject line, and message body, and finally insert a row into a dedicated MySQL table.
  • As part of the extraction process I also allocated a random number to each message and created some derived fields such as the number of people the message was sent to, the number of words in the body, and the percentage of words that were in all-caps.The random number was used to select a sub-sample from the full table to use as a development set, while the derived fields were created as features to be fed into the Bayesian classifier.As an aside, I’ve found that it is generally a good idea to allocate a random number to each record when you are creating a dataset. Most databases, be they relational or NoSQL, don’t handle random selection tasks very well, so it can save time to have way of randomly ordering records pre-baked in to your data.Those interested in the details can download the code at Just do a search for KiwiPycon to find the files.It is worth noting that more cleaning could have been done to the messages. For instance, a number of the messages are forwarded or replies. So, they contain two or more messages in one, along with the original message metadata. This sort of duplication and noise might confuse a learning algorithm. Alternatively, they may also help the algorithm pick up on key signals about the message context. Whatever the case, an attempt would ideally be made to train the classifier with and without the presence of these confounds to determine the effect on model accuracy.
  • Here is one of the messages from the set. I classified this as a legal/regulatory message, but it is possible that someone else might classify it as an administrative message since it pertains to a meeting agenda. The key point is that any ambiguity inherent in the message context will likely flow through into ambiguity in the training phase and degrade the performance of the classifier.
  • This message is a little more clear cut. I’ve classified it as an IT message, since it pertains to the output from an automated process. At a glance we can see that messages such as this are likely to contain signals that should be easily picked up by a classifier. For instance, the message is short, contains a number of all caps words, and is likely to contain symbols such as ‘--&gt;&gt;’ that would not normally appear in ‘human’ messages. Look at the message a few seconds more and you will probably come up with a handful of other signals that might help a classifier distinguish this sort of message from another. This is a key way to come up with features to feed into your classifier. So, having a human parse some of the documents is a key part of the idea generation and model training phase.
  • Another aspect that may require human involvement is in assigning a class to the development set. For this example, I went through and tagged a random sample of 1000 messages with one or more classes. Don’t try this at home. It is quite boring. The more sensible approach, if feasible, would be to outsource the tagging to a crowdsourcing service such as Amazon Mechanical Turk or CrowdFlower. I did actually attempt to put my sample through CrowdFlower, but came up against a number of formatting issues with the messages. These, and the looming conference deadline, led me to decide to tag them myself using a hastily created MSAccess form instead!If you are particularly lucky you may have a development set that is already tagged, or that it is trivial to tag automatically. For instance, if you were building a classifier to predict which stocks would rise in value in future, you would be able to tag an historic training set very easily using historic stock price information.A couple of points to note from this chart are:My intent here is to build one classifier per class. For instance, I’ll build one classifier that predicts whether a given message is related to External Relations, or not. Another classifier will predict whether a given message is related to Info Tech or not, and so on. Just be aware that there are classifiers that will allow you to predict across multiple classes. Indeed, you could even combine the results from the separate models in this example to get one ‘overall predicted class’ for each message. However, this is beyond the scope of this talk and my knowledge at this point!You’ll notice that the ‘External Relations’ and ‘Social/Personal’ classes doesn’t have many messages in them. This is likely to hinder development of an accurate classifier for those classes, since the classifier will not have many examples to train on. In contrast, there is a better chance we will be able to build an accurate classifier for the IT, Regulatory, and Deals classes.
  • With the development set of 1000 randomly selected messages in place, the next step is to pull out a range of features for the classifier to train on. I’ve taken a kitchen sink, ‘one size fits all’ approach for the models in this talk, but with more time I’d select features for each model in an iterative process. For instance, you’d throw in a bunch of features into the first model (say, for IT), check out its accuracy, then whittle down or modify the features and see what effect that has on accuracy. What you are aiming for is a model that is both robust (i.e., it gives good predictions against a test set of messages) and parsimonious (i.e., it uses the least number of features necessary to achieve its accuracy). Simply throwing the kitchen sink at your model, as I have done here, increases the chance of overfitting, which is where the model performs well on the training set but poorly when you attempt to apply it to documents that it was not trained on.This slide lists the core NLTK modules I used to expand out the feature set for my models. You can download the code relating to this at
  • A couple of points worth noting about the feature development stage is that you’ll generally be aiming to generate features from your documents that are either Boolean or nominal (aka, categorical). By nominal I mean a short list of possible categories, not necessarily with any inherent order. The example on this slide shows a set of categories for the ‘num_words_in_body’ feature which do have an inherent order (‘long’ messages are larger than ‘medium’ messages), but a Bayesian classifier is ignorant of this. As far as it is concerned, you might as well be giving it a list of colors. All it cares about is that you are giving it a feature which casts a message into a specific, mutually exclusive group based on this thing you’ve called ‘num_words_in_body’.
  • A Boolean feature is formatted as its name suggests. The token-based features (ngrams, bigrams etc.) extracted using some of the NLTK tools are generally presented in Boolean format to classifiers for training; either the feature exists for the document, or it doesn’t.Note that you could also feed numeric features into the classifier. However, you need to take care to avoid a situation where the feature can take on so many different values that there are not enough instances of each value in the development set for the classifier to build up good history of that instance’s co-occurrence with each document class. This is why I created the ‘num_words_in_body’ feature as a series of bins rather than leaving it as a discrete numeric variable.One final thing to note is that all of the features I’ve extracted are ones that I will be able to extract for any new, unclassified, documents I want to feed into the classifier. If they weren’t, I couldn’t expect any resulting trained classifier to accurately predict the class for new documents.
  • Much of the grunt work has now been done. The data is in a reasonable state for modelling, a range of features have been extracted, and the development set has at least one class assigned to each document. The next step is to train a classifier using the data and assess its accuracy. NLTK makes this relatively simple. As the slide shows, the development set is split into two chunks at random. The first chunk, the training set, is used to train the classifier. The second chunk, the test set, is use to test the model’s accuracy. This second chunk is sometimes also referred to as a hold-out sample, and acts as a tool for determining how the classifier might perform ‘in practice’ if we were to ask it to predict the classes for documents that it had not seen during its training.You can test a model more rigorously than shown in this slide, by performing multiple rounds of cross-validation. But this is enough to provide a good overview of the process.You’ll see here that I have used two of NLTK tools for assessing accuracy: the accuracy module and the ‘show_most_informative_features’ method of the classifier itself. Again, I’ve used these for illustrative purposes but there are other testing metrics that could also be applied to assess the accuracy of the model.Again, you can get my code for this from
  • Here are those two performance metrics for the ‘IT’ classifier. The accuracy figure of 0.93 indicates that the classifier allocated the correct class (IT or not IT) to 93% of the test cases. This is a promising sign, and not unexpected given the general structure of these types of messages.The ‘most informative features’ list outlines the features that are most useful to the model in helping it distinguish ‘IT’ messages from other messages, along with the likelihood ratios for those features. It shows, for instance, that messages containing the word (or token) ‘txt’ are IT-related about 150 times more often than they are non IT-related. The features listed make sense given the IT context, so we can gain some confidence that the model is working OK.
  • This slide shows the same metrics for a classifier built to distinguish ‘deal or trading’ messages from other messages. Again, the results are generally promising. Accuracy is not as good as for the IT model, but it is still reasonable, and the most informative features make sense. The terms ‘nom’ and ‘nomin’ are common trading terms, while ‘mmbtu’ and ‘txu’ relate to volume units.The likelihood ratios are not as high as some of those identified for the IT model, so this model may have more difficulty distinguishing between ‘deal’ and ‘non-deal’ message types.
  • And here are the metrics for a poorly performing model: ‘social/personal’.The accuracy figure of 0.48 suggests this classifier is having quite a bit of difficulty distinguishing ‘social’ messages from others. If you cast your mind back to the chart I presented earlier you’ll remember that very few messages were tagged as ‘social’ in the training set, so the classifier didn’t have much to go on. Given that, this result is not surprising. More work would need to be done to get more training cases for this class or to tease out some more features that might give the classifier a better chance of predicting correctly.One positive is that the informative features seem to make sense given the context. Most of the terms are those you would expect to see more often in personal or social messages, and less often in other types of messages.
  • Although the accuracy metric presented earlier can be a good indicator of classifier performance, it is dependant on the rate of incidence of a class in the test set. For example, imagine a classifier for the class “hen’s tooth”, where only 1 out of 100 test cases is actually a hen’s tooth. If the “hen’s tooth” classifier was really bad, so that all it did was classify each document as “not a hen’s tooth”, then it would classify 99 out of the 100 test cases correctly. That is, even a very bad model can theoretically achieve an accuracy score of 99%.It is therefore wise to consider a variety of performance metrics when assessing a classifier. This slide presents two alternative ways of looking at model performance that I find useful. They rely on the fact that the NLTK classifiers produce both a predicted class given document along with the probability score associated with the predicted class. You can take the probability scores for all documents in your test set, order them from lowest to highest, and assign each document to a decile (9 being the 10% of documents with the highest probability of being your class of interest). The table and chart above are based on these deciles, and show what the actual classes were for the documents in each decile. Note, I’ve actually produced these using my full development set of 1000 cases, rather than just the training set of 333. You shouldn’t really do this, but I wanted to use all of my cases to give a little stability to the figures produced. So, bear in mind that these are ‘cheat’ scores for the models.The percentages in the table should be read along each row, and give an indication of where the model is getting confused. Percentages less than 10% have been supressed. For the IT model there is very little confusion: 95% of the cases in the top decile were actually IT messages, and 50% of the cases in decile 8 were IT messages. For all other deciles the percentage of cases that were IT was less than 10%.The chart presents information only for ‘hits’; that is, those cases that were classed as ‘IT’ (167 out of 1000). It shows that around 55% of those cases appeared in the top decile, and that around 85% of the cases appeared in the top 2 deciles. As a comparison, if the classifier was simply allocating probabilities at random, you’d expect the ‘hits’ to be evenly distributed across the deciles, so around 10% of cases would be in the top decile, 20% would be in the top 2 deciles, and so on. This is what the dashed blue line represents.For an ideal classifier you would see all of the ‘hits’ appearing in the top deciles, with the cumulative line going to 100% very quickly. Overall, then, these results indicate the classifier performs very well at classifying IT messages.
  • Looking at the charts for the ‘deal’ classifier, it appears that this is also performing well, although perhaps not as well as the IT classifier. The table suggests that the classifier is having some difficulty distinguishing ‘deal’ messages from ‘other’ and ‘legal’ messages, so some more work could be done to tease out features that might help it there.
  • Finally, the ‘social’ classifier is performing about as badly as we might have expected from early indications!The table suggests that the model is having difficulty distinguishing ‘social’ messages from other messages across most types. This is reinforced by the information in the chart, which has the model performing only slightly better than what we might expect by chance.At this point, if we cast our focus back to the prosecutor in the Enron case, we can start to see a way forward with automated document classification. Even after this ‘beta’ run-through, the persecutor could be confident that a classifier would help them eliminate IT messages (estimated to be around 17% of all messages in the sample) from the haystack. Similarly, if some deal messages were likely to contain information of interest to the case, there is now a classifier that can be used to identify those messages from the haystack that should be prioritised for human review. And, as those messages were reviewed more information would become available that could then be used to feed into further development of the classifiers.
  • So there you have it. We’ve covered the process for developing automated document classifiers using the NLTK and python, with the Enron emails being an illustrative example. Hopefully you’ll be able to see how this sort of modelling could be useful to you professionally or personally.I’ve mentioned a few ‘gotchas’ along the way, but it is worth reiterating them before I finish up, since they can have a big impact on the success of your classifiers.Try to avoid using a biased sample to train your classifiers. If your development set doesn’t contain the same general mix of documents as those you expect to apply your classifier to, you can’t really expect your classifier to perform well in practice.Be careful of the ‘accuracy’ metric, particularly if you are building a classifier for a rare document type. There are methods for dealing with modelling situations for rare events which I’ve not covered here, and it pays to take a look at a few different performance metrics for your model before you finalise and rely on it.Your prior knowledge, or that of subject matter experts, can be very useful for identifying key features to give your classifier for training. Ultimately, you are trying to get the classifier to pick up on knowledge that you probably already have as a human, so that you can apply that knowledge at scale. So, you need to think about what it is that enables you as a human to distinguish one document type from another and formalise that as much as possible so that it can be used in an algorithm.Don’t expect the first classifier you build to be the best one. Like most things in life, there is a learning cycle involved here. You’ll probably go through a number of ‘draft’ classifiers before you settle on a final classifier that is reliable and parsimonious. Even then, once the classifier is in production it will need to be tweaked as new knowledge about the domain becomes available.
  • Those interested in trying their hand at automated classification will find the following resources useful. The first link is to the NLTK website. There, you’ll also find a link to a book covering the toolkit, natural language processing, and document classification.The second link is to a downloadable archive of the Enron emails.Those interested in learning more about machine learning should also consider taking the free online course being offered through Stanford in October to November 2011. You can sign up at ml-class.comFinally, Jacob Perkins runs a blog about python, natural language processing and machine learning which contains a number of very useful examples, demos, and articles.
  • Document Classification using the Python Natural Language Toolkit

    1. 1. Document Classification using the Natural Language Toolkit<br />Ben Healey<br /><br />@BenHealey<br />
    2. 2. Source: IStockPhoto<br />
    3. 3.<br />The Need for Automation<br />
    4. 4. Take urpick!<br /><br />
    5. 5. Features:<br />- # Words<br />- % ALLCAPS<br />- Unigrams<br />- Sender<br />- And so on.<br />Class:<br />The Development Set<br />Classification<br />Algo.<br />Trained Classifier<br />(Model)<br />New Document<br />(Class Unknown)<br />Classified<br />Document.<br />Document<br />Features<br />
    6. 6. Relevant NLTK Modules<br />Feature Extraction<br />from nltk.corpus import words, stopwords<br />from nltk.stem import PorterStemmer<br />from nltk.tokenize import WordPunctTokenizer<br />from nltk.collocations import BigramCollocationFinder<br />from nltk.metrics import BigramAssocMeasures<br />See for examples<br />Machine Learning Algos and Tools<br />from nltk.classify import NaiveBayesClassifier<br />from nltk.classify import DecisionTreeClassifier<br />from nltk.classify import MaxentClassifier<br />from nltk.classify import WekaClassifier<br />from nltk.classify.util import accuracy<br />
    7. 7. NaiveBayesClassifier<br />P(label|features)=P(label) ∗ P(features|label) P(features)<br /> <br />P(label|features)=P(label) ∗ P(f1|label)∗...∗ P(fn|label)  P(features)<br /> <br /><br />
    8. 8.<br />
    9. 9. 517,431 Emails<br />Source: IStockPhoto<br />
    10. 10. Prep: Extract and Load<br />Sample* of 20,581 plaintext files<br />import MySQLdb, os, random, string<br /> MySQL via Python ODBC interface<br />File, string manipulation<br />Key fields separated out<br />To, From, CC, Subject, Body<br />* Folders for 7 users with a large number of email. So not representative!<br />
    11. 11. Prep: Extract and Load<br />Allocation of random number<br />Some feature extraction<br />#To, #CCd, #Words, %digits, %CAPS<br />Note: more cleaning could be done<br />Code at<br />
    12. 12. From:<br />To:<br />Subject: Re: Agenda for FERC Meeting RE: EOL<br />Louise --<br />We had decided that not having Mark in the room gave us the ability to wiggle if questions on CFTC vs. FERC regulation arose. As you can imagine, FERC is starting to grapple with the issue that financial trades in energy commodities is regulated under the CEA, not the Federal Power Act or the Natural Gas Act. <br />Thanks,<br />Jim<br />
    13. 13. From:<br />To:<br />Subject: Start Date: 1/11/02; HourAhead hour: 5;<br />Start Date: 1/11/02; HourAhead hour: 5; No ancillary schedules awarded. No variances detected. <br /> LOG MESSAGES:<br />PARSING FILE -->> O:PortlandWestDeskCalifornia SchedulingISO Final Schedules2002011105.txt<br />
    14. 14. Class[es] assigned for 1,000 randomly selected messages:<br />
    15. 15. Prep: Show us ur Features<br />NLTK toolset<br />from nltk.corpus import words, stopwords<br />from nltk.stem import PorterStemmer<br />from nltk.tokenize import WordPunctTokenizer<br />from nltk.collocations import BigramCollocationFinder<br />from nltk.metrics import BigramAssocMeasures<br />Custom code<br />def extract_features(record,stemmer,stopset,tokenizer):<br />…<br />Code at<br />
    16. 16. Prep: Show us ur Features<br />Features in boolean or nominal form<br />if record['num_words_in_body']<=20:<br />features['message_length']='Very Short'<br />elif record['num_words_in_body']<=80:<br /> features['message_length']='Short'<br />elif record['num_words_in_body']<=300:<br /> features['message_length']='Medium'<br />else:<br /> features['message_length']='Long'<br />
    17. 17. Prep: Show us ur Features<br />Features in boolean or nominal form<br />text=record['msg_subject']+" "+record['msg_body']<br />tokens = tokenizer.tokenize(text)<br />words = [stemmer.stem(x.lower()) for x in tokens if x not in stopset and len(x) > 1]<br />for word in words:<br /> features[word]=True<br />
    18. 18. Sit. Say. Heel.<br />random.shuffle(dev_set)<br />cutoff = len(dev_set)*2/3<br />train_set=dev_set[:cutoff]<br />test_set=dev_set[cutoff:]<br />classifier = NaiveBayesClassifier.train(train_set)<br />print 'accuracy for > ',subject,':', accuracy(classifier, test_set)<br />classifier.show_most_informative_features(10)<br />
    19. 19. Most Important Features<br />
    20. 20. Most Important Features<br />
    21. 21. Most Important Features<br />
    22. 22. Performance: ‘IT’ Model<br />IMPORTANT: These are ‘cheat’ scores!<br />
    23. 23. Performance: ‘Deal’ Model<br />IMPORTANT: These are ‘cheat’ scores!<br />
    24. 24. Performance: ‘Social’ Model<br />IMPORTANT: These are ‘cheat’ scores!<br />
    25. 25. Don’t get burned.<br /><ul><li>Biased samples
    26. 26. Accuracy and rare events
    27. 27. Features and prior knowledge
    28. 28. Good modelling is iterative!
    29. 29. Resampling and robustness
    30. 30. Learning cycles</li></ul><br />
    31. 31. Resources<br />NLTK: <br /><br /><br />Enron email datasets:<br /> <br />Free online Machine Learning course from Stanford <br /> (starts in October)<br />StreamHacker blog by Jacob Perkins<br /><br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.