SlideShare a Scribd company logo
1 of 26
Document Classification using the Natural Language Toolkit Ben Healey http://benhealey.info @BenHealey
Source: IStockPhoto
http://upload.wikimedia.org/wikipedia/commons/b/b6/FileStack_retouched.jpg The Need for Automation
Take urpick! http://upload.wikimedia.org/wikipedia/commons/d/d6/Cat_loves_sweets.jpg
Features: - # Words - % ALLCAPS - Unigrams - Sender - And so on. Class: The Development Set Classification Algo. Trained Classifier (Model) New Document (Class Unknown) Classified Document. Document Features
Relevant NLTK Modules Feature Extraction from nltk.corpus import words, stopwords from nltk.stem import PorterStemmer from nltk.tokenize import WordPunctTokenizer from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures See http://text-processing.com/demo/ for examples Machine Learning Algos and Tools from nltk.classify import NaiveBayesClassifier from nltk.classify import DecisionTreeClassifier from nltk.classify import MaxentClassifier from nltk.classify import WekaClassifier from nltk.classify.util import accuracy
NaiveBayesClassifier P(label|features)=P(label) ∗ P(features|label) P(features)   P(label|features)=P(label) ∗ P(f1|label)∗...∗ P(fn|label)  P(features)   http://61.153.44.88/nltk/0.9.5/api/nltk.classify.naivebayes-module.html
http://www.educationnews.org/commentaries/opinions_on_education/91117.html
517,431 Emails Source: IStockPhoto
Prep: Extract and Load Sample* of 20,581 plaintext files import MySQLdb, os, random, string  MySQL via Python ODBC interface File, string manipulation Key fields separated out To, From, CC, Subject, Body * Folders for 7 users with a large number of email.  So not representative!
Prep: Extract and Load Allocation of random number Some feature extraction #To, #CCd, #Words, %digits, %CAPS Note: more cleaning could be done Code at benhealey.info
From: james.steffes@enron.com To: louise.kitchen@enron.com Subject: Re: Agenda for FERC Meeting RE: EOL Louise -- We had decided that not having Mark in the room gave us the ability to wiggle if questions on CFTC vs. FERC regulation arose.  As you can imagine, FERC is starting to grapple with the issue that financial trades in energy commodities is regulated under the CEA, not the Federal Power Act or the Natural Gas Act.   Thanks, Jim
From: pete.davis@enron.com To: pete.davis@enron.com Subject: Start Date: 1/11/02; HourAhead hour: 5; Start Date: 1/11/02; HourAhead hour: 5;  No ancillary schedules awarded.  No variances detected.      LOG MESSAGES: PARSING FILE -->> O:ortlandestDeskalifornia SchedulingSO Final Schedules002011105.txt
Class[es] assigned for 1,000 randomly selected messages:
Prep: Show us ur Features NLTK toolset from nltk.corpus import words, stopwords from nltk.stem import PorterStemmer from nltk.tokenize import WordPunctTokenizer from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures Custom code def extract_features(record,stemmer,stopset,tokenizer): … Code at benhealey.info
Prep: Show us ur Features Features in boolean or nominal form if record['num_words_in_body']<=20: features['message_length']='Very Short' elif record['num_words_in_body']<=80: 	features['message_length']='Short' elif record['num_words_in_body']<=300: 	features['message_length']='Medium' else: 	features['message_length']='Long'
Prep: Show us ur Features Features in boolean or nominal form text=record['msg_subject']+" "+record['msg_body'] tokens = tokenizer.tokenize(text) words =  [stemmer.stem(x.lower()) for x in tokens if 	x not in stopset and len(x) > 1] for word in words: 	features[word]=True
Sit. Say. Heel. random.shuffle(dev_set) cutoff = len(dev_set)*2/3 train_set=dev_set[:cutoff] test_set=dev_set[cutoff:] classifier = NaiveBayesClassifier.train(train_set) print 'accuracy for > ',subject,':', 	accuracy(classifier, test_set) classifier.show_most_informative_features(10)
Most Important Features
Most Important Features
Most Important Features
Performance: ‘IT’ Model IMPORTANT: These are ‘cheat’ scores!
Performance: ‘Deal’ Model IMPORTANT: These are ‘cheat’ scores!
Performance: ‘Social’ Model IMPORTANT: These are ‘cheat’ scores!
Don’t get burned. ,[object Object]
Accuracy and rare events

More Related Content

What's hot

Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Simplilearn
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Python Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & stylePython Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & styleKevlin Henney
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyoutsider2
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackDavid Sanchez
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learningtrygub
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programmingSrinivas Narasegouda
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for BioinformaticsJosé Héctor Gálvez
 
Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Eran Yahav
 
Screaming fast json parsing on Android
Screaming fast json parsing on AndroidScreaming fast json parsing on Android
Screaming fast json parsing on AndroidKarthik Ramgopal
 
Getting started with Linux and Python by Caffe
Getting started with Linux and Python by CaffeGetting started with Linux and Python by Caffe
Getting started with Linux and Python by CaffeLihang Li
 
Ground Gurus - Python Code Camp - Day 3 - Classes
Ground Gurus - Python Code Camp - Day 3 - ClassesGround Gurus - Python Code Camp - Day 3 - Classes
Ground Gurus - Python Code Camp - Day 3 - ClassesChariza Pladin
 

What's hot (20)

Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Python Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & stylePython Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & style
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStack
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learning
 
Pythonpresent
PythonpresentPythonpresent
Pythonpresent
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
 
Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)
 
Screaming fast json parsing on Android
Screaming fast json parsing on AndroidScreaming fast json parsing on Android
Screaming fast json parsing on Android
 
Reversing Google Protobuf protocol
Reversing Google Protobuf protocolReversing Google Protobuf protocol
Reversing Google Protobuf protocol
 
NLP Project Full Circle
NLP Project Full CircleNLP Project Full Circle
NLP Project Full Circle
 
Getting started with Linux and Python by Caffe
Getting started with Linux and Python by CaffeGetting started with Linux and Python by Caffe
Getting started with Linux and Python by Caffe
 
Ground Gurus - Python Code Camp - Day 3 - Classes
Ground Gurus - Python Code Camp - Day 3 - ClassesGround Gurus - Python Code Camp - Day 3 - Classes
Ground Gurus - Python Code Camp - Day 3 - Classes
 
Taking User Input in Java
Taking User Input in JavaTaking User Input in Java
Taking User Input in Java
 
Dart programming language
Dart programming languageDart programming language
Dart programming language
 

Viewers also liked

Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnOlivier Grisel
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKOlivier Grisel
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/CategorizationOswal Abhishek
 
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)David Graus
 
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsGenerating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsDavid Graus
 
Actividad1
Actividad1Actividad1
Actividad1ava965
 
Practica numero 07
Practica numero 07Practica numero 07
Practica numero 07AlmKarl
 
911 aerial photos.recently_disclassified
911 aerial photos.recently_disclassified911 aerial photos.recently_disclassified
911 aerial photos.recently_disclassifiedWalter Williams
 
Договор
ДоговорДоговор
Договорbiletprofi
 
Chapter9c McHaney 2nd edition
Chapter9c McHaney 2nd editionChapter9c McHaney 2nd edition
Chapter9c McHaney 2nd editionRoger McHaney
 
Chapter1a McHaney 2nd edition
Chapter1a McHaney 2nd editionChapter1a McHaney 2nd edition
Chapter1a McHaney 2nd editionRoger McHaney
 
Watson foundation - making sense of your data
Watson foundation - making sense of your data Watson foundation - making sense of your data
Watson foundation - making sense of your data ThierryHendrickx
 
Ibm watson boston meetup may 27 2015
Ibm watson boston meetup may 27 2015Ibm watson boston meetup may 27 2015
Ibm watson boston meetup may 27 2015IBM
 

Viewers also liked (20)

Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learn
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
 
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsGenerating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
 
Actividad1
Actividad1Actividad1
Actividad1
 
Practica numero 07
Practica numero 07Practica numero 07
Practica numero 07
 
911 aerial photos.recently_disclassified
911 aerial photos.recently_disclassified911 aerial photos.recently_disclassified
911 aerial photos.recently_disclassified
 
Dhan
DhanDhan
Dhan
 
Договор
ДоговорДоговор
Договор
 
Chapter9c McHaney 2nd edition
Chapter9c McHaney 2nd editionChapter9c McHaney 2nd edition
Chapter9c McHaney 2nd edition
 
Chapter7b McHaney
Chapter7b McHaneyChapter7b McHaney
Chapter7b McHaney
 
Chapter5b McHaney
Chapter5b McHaneyChapter5b McHaney
Chapter5b McHaney
 
Chapter3b McHaney
Chapter3b McHaneyChapter3b McHaney
Chapter3b McHaney
 
Chapter1a McHaney
Chapter1a McHaneyChapter1a McHaney
Chapter1a McHaney
 
Chapter1a McHaney 2nd edition
Chapter1a McHaney 2nd editionChapter1a McHaney 2nd edition
Chapter1a McHaney 2nd edition
 
Apriori alg
Apriori algApriori alg
Apriori alg
 
Orsbasaynhs presentation
Orsbasaynhs presentationOrsbasaynhs presentation
Orsbasaynhs presentation
 
Watson foundation - making sense of your data
Watson foundation - making sense of your data Watson foundation - making sense of your data
Watson foundation - making sense of your data
 
Ibm watson boston meetup may 27 2015
Ibm watson boston meetup may 27 2015Ibm watson boston meetup may 27 2015
Ibm watson boston meetup may 27 2015
 

Similar to Document Classification using the Python Natural Language Toolkit

Simple tools to fight bigger quality battle
Simple tools to fight bigger quality battleSimple tools to fight bigger quality battle
Simple tools to fight bigger quality battleAnand Ramdeo
 
May: Automated Developer Testing: Achievements and Challenges
May: Automated Developer Testing: Achievements and ChallengesMay: Automated Developer Testing: Achievements and Challenges
May: Automated Developer Testing: Achievements and ChallengesTriTAUG
 
Automated Developer Testing: Achievements and Challenges
Automated Developer Testing: Achievements and ChallengesAutomated Developer Testing: Achievements and Challenges
Automated Developer Testing: Achievements and ChallengesTao Xie
 
Comparative Development Methodologies
Comparative Development MethodologiesComparative Development Methodologies
Comparative Development Methodologieselliando dias
 
Entity Framework v2 Best Practices
Entity Framework v2 Best PracticesEntity Framework v2 Best Practices
Entity Framework v2 Best PracticesAndri Yadi
 
Intro To Spring Python
Intro To Spring PythonIntro To Spring Python
Intro To Spring Pythongturnquist
 
Sour Pickles
Sour PicklesSour Pickles
Sour PicklesSensePost
 
Accelerated data access
Accelerated data accessAccelerated data access
Accelerated data accessgordonyorke
 
ESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorialESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorialJonathon Hare
 
Swift profiling middleware and tools
Swift profiling middleware and toolsSwift profiling middleware and tools
Swift profiling middleware and toolszhang hua
 
Owner - Java properties reinvented.
Owner - Java properties reinvented.Owner - Java properties reinvented.
Owner - Java properties reinvented.Luigi Viggiano
 
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...Flink Forward
 
Poco Es Mucho: WCF, EF, and Class Design
Poco Es Mucho: WCF, EF, and Class DesignPoco Es Mucho: WCF, EF, and Class Design
Poco Es Mucho: WCF, EF, and Class DesignJames Phillips
 
Adding a modern twist to legacy web applications
Adding a modern twist to legacy web applicationsAdding a modern twist to legacy web applications
Adding a modern twist to legacy web applicationsJeff Durta
 
Python-nose: A unittest-based testing framework for Python that makes writing...
Python-nose: A unittest-based testing framework for Python that makes writing...Python-nose: A unittest-based testing framework for Python that makes writing...
Python-nose: A unittest-based testing framework for Python that makes writing...Timo Stollenwerk
 
Release with confidence
Release with confidenceRelease with confidence
Release with confidenceJohn Congdon
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The EnterpriseDaniel Egan
 

Similar to Document Classification using the Python Natural Language Toolkit (20)

Simple tools to fight bigger quality battle
Simple tools to fight bigger quality battleSimple tools to fight bigger quality battle
Simple tools to fight bigger quality battle
 
2012: ql.io and Node.js
2012: ql.io and Node.js2012: ql.io and Node.js
2012: ql.io and Node.js
 
May: Automated Developer Testing: Achievements and Challenges
May: Automated Developer Testing: Achievements and ChallengesMay: Automated Developer Testing: Achievements and Challenges
May: Automated Developer Testing: Achievements and Challenges
 
Automated Developer Testing: Achievements and Challenges
Automated Developer Testing: Achievements and ChallengesAutomated Developer Testing: Achievements and Challenges
Automated Developer Testing: Achievements and Challenges
 
Comparative Development Methodologies
Comparative Development MethodologiesComparative Development Methodologies
Comparative Development Methodologies
 
Entity Framework v2 Best Practices
Entity Framework v2 Best PracticesEntity Framework v2 Best Practices
Entity Framework v2 Best Practices
 
Intro To Spring Python
Intro To Spring PythonIntro To Spring Python
Intro To Spring Python
 
Sour Pickles
Sour PicklesSour Pickles
Sour Pickles
 
Modern Python Testing
Modern Python TestingModern Python Testing
Modern Python Testing
 
Accelerated data access
Accelerated data accessAccelerated data access
Accelerated data access
 
Entity Framework 4
Entity Framework 4Entity Framework 4
Entity Framework 4
 
ESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorialESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorial
 
Swift profiling middleware and tools
Swift profiling middleware and toolsSwift profiling middleware and tools
Swift profiling middleware and tools
 
Owner - Java properties reinvented.
Owner - Java properties reinvented.Owner - Java properties reinvented.
Owner - Java properties reinvented.
 
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
 
Poco Es Mucho: WCF, EF, and Class Design
Poco Es Mucho: WCF, EF, and Class DesignPoco Es Mucho: WCF, EF, and Class Design
Poco Es Mucho: WCF, EF, and Class Design
 
Adding a modern twist to legacy web applications
Adding a modern twist to legacy web applicationsAdding a modern twist to legacy web applications
Adding a modern twist to legacy web applications
 
Python-nose: A unittest-based testing framework for Python that makes writing...
Python-nose: A unittest-based testing framework for Python that makes writing...Python-nose: A unittest-based testing framework for Python that makes writing...
Python-nose: A unittest-based testing framework for Python that makes writing...
 
Release with confidence
Release with confidenceRelease with confidence
Release with confidence
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The Enterprise
 

Recently uploaded

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Document Classification using the Python Natural Language Toolkit

  • 1. Document Classification using the Natural Language Toolkit Ben Healey http://benhealey.info @BenHealey
  • 5. Features: - # Words - % ALLCAPS - Unigrams - Sender - And so on. Class: The Development Set Classification Algo. Trained Classifier (Model) New Document (Class Unknown) Classified Document. Document Features
  • 6. Relevant NLTK Modules Feature Extraction from nltk.corpus import words, stopwords from nltk.stem import PorterStemmer from nltk.tokenize import WordPunctTokenizer from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures See http://text-processing.com/demo/ for examples Machine Learning Algos and Tools from nltk.classify import NaiveBayesClassifier from nltk.classify import DecisionTreeClassifier from nltk.classify import MaxentClassifier from nltk.classify import WekaClassifier from nltk.classify.util import accuracy
  • 7. NaiveBayesClassifier P(label|features)=P(label) ∗ P(features|label) P(features)   P(label|features)=P(label) ∗ P(f1|label)∗...∗ P(fn|label)  P(features)   http://61.153.44.88/nltk/0.9.5/api/nltk.classify.naivebayes-module.html
  • 10. Prep: Extract and Load Sample* of 20,581 plaintext files import MySQLdb, os, random, string  MySQL via Python ODBC interface File, string manipulation Key fields separated out To, From, CC, Subject, Body * Folders for 7 users with a large number of email. So not representative!
  • 11. Prep: Extract and Load Allocation of random number Some feature extraction #To, #CCd, #Words, %digits, %CAPS Note: more cleaning could be done Code at benhealey.info
  • 12. From: james.steffes@enron.com To: louise.kitchen@enron.com Subject: Re: Agenda for FERC Meeting RE: EOL Louise -- We had decided that not having Mark in the room gave us the ability to wiggle if questions on CFTC vs. FERC regulation arose. As you can imagine, FERC is starting to grapple with the issue that financial trades in energy commodities is regulated under the CEA, not the Federal Power Act or the Natural Gas Act. Thanks, Jim
  • 13. From: pete.davis@enron.com To: pete.davis@enron.com Subject: Start Date: 1/11/02; HourAhead hour: 5; Start Date: 1/11/02; HourAhead hour: 5; No ancillary schedules awarded. No variances detected. LOG MESSAGES: PARSING FILE -->> O:ortlandestDeskalifornia SchedulingSO Final Schedules002011105.txt
  • 14. Class[es] assigned for 1,000 randomly selected messages:
  • 15. Prep: Show us ur Features NLTK toolset from nltk.corpus import words, stopwords from nltk.stem import PorterStemmer from nltk.tokenize import WordPunctTokenizer from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures Custom code def extract_features(record,stemmer,stopset,tokenizer): … Code at benhealey.info
  • 16. Prep: Show us ur Features Features in boolean or nominal form if record['num_words_in_body']<=20: features['message_length']='Very Short' elif record['num_words_in_body']<=80: features['message_length']='Short' elif record['num_words_in_body']<=300: features['message_length']='Medium' else: features['message_length']='Long'
  • 17. Prep: Show us ur Features Features in boolean or nominal form text=record['msg_subject']+" "+record['msg_body'] tokens = tokenizer.tokenize(text) words = [stemmer.stem(x.lower()) for x in tokens if x not in stopset and len(x) > 1] for word in words: features[word]=True
  • 18. Sit. Say. Heel. random.shuffle(dev_set) cutoff = len(dev_set)*2/3 train_set=dev_set[:cutoff] test_set=dev_set[cutoff:] classifier = NaiveBayesClassifier.train(train_set) print 'accuracy for > ',subject,':', accuracy(classifier, test_set) classifier.show_most_informative_features(10)
  • 22. Performance: ‘IT’ Model IMPORTANT: These are ‘cheat’ scores!
  • 23. Performance: ‘Deal’ Model IMPORTANT: These are ‘cheat’ scores!
  • 24. Performance: ‘Social’ Model IMPORTANT: These are ‘cheat’ scores!
  • 25.
  • 27. Features and prior knowledge
  • 28. Good modelling is iterative!
  • 31. Resources NLTK: www.nltk.org/ http://www.nltk.org/book Enron email datasets: http://www.cs.umass.edu/~ronb/enron_dataset.html Free online Machine Learning course from Stanford http://ml-class.com/ (starts in October) StreamHacker blog by Jacob Perkins http://streamhacker.com

Editor's Notes

  1. Thanks for coming along. I am Ben Healey and this talk will be about using the python-based Natural Language Toolkit to automate document classification. My background is in market research and analytics, so my day job primarily involves coding in SAS, working with databases and excel to extract business insights. I also advise on survey design and development. I am relatively new to Python and have recently had reason to use the Natural Language Toolkit to help me with some document classification I need to do. So, when the Kiwi Pycon call for papers came around I thought this would be a good opportunity to learn some more about this process and share my experience with others.
  2. My aim today is to cover the overall process involved in developing a document classification algorithm using the NLTK. You’ll come away with an understanding of where to start if you want to do something similar yourself. I’ll also introduce some terms specific to Machine Learning and the NLTK. In particular, I’ll cover:The need for automated classification.Reasons to use Natural Language Toolkit, and the alternatives available.Machine Learning tools in the NLTK, and the Naïve Bayes classifier in particular.Using python for data preparation.Training and assessing your classifier.Tricks and traps.And, finally, further resources to explore.
  3. The need for automated approaches to common document classification tasks is self evident. Individuals and businesses are dealing with an increasing volume of information in a variety of formats, be it text, audio, image or video. You may think of web-based content generators and social media as being key contributors to this deluge, but organisations are also collecting more complex data internally from customers, employees and other stakeholders that they need to make sense of.Unfortunately, people don’t scale. We are comparatively expensive and get tired or bored easily. We are also quite subjective and vary from day to day in the way we apply classifications. So there are potential financial, consistency and time gains to be made from automating repetitive document classification tasks.Some common applications include:Spam filtering Document language identification, for subsequent translation. For instance, ‘Google translate’ uses this to automatically detect the language of a text snippet you want to translate into English.Sentiment analysis. For instance, and organization may analyse the tweets about its new product to track whether or not the comments are generally positive, negative, or neutral over time.Google Adsense and other contextual banner ad networks, which choose the banner ads to place on a webpage by first classifying the topic of the text on that page.One application that is not common, but which we’ll examine today, is document classification to aid the legal discovery process. I managed to find one organisation called Blackstone Discovery that is doing this sort of thing commercially. So, while the example I’ll use today is contrived, it does reflect a real-world problem.
  4. Before getting into the example, it is worth considering the process and tools we’ll be using along the way. I’ve chosen the Natural Language Toolkit for this talk, but there are a range of other tools that can be used with python to build automatic classifiers. Examples include:Sci-Kits, which has a number of machine learning modules in its ‘learn’ package.Orange, which is a data visualization and analysis toolkit written in python. It even has a visual programming interface. And the ever popular R language, which can be accessed via a Python interface called RPy. For a good discussion of these alternatives and others, do a search on StackOverflow for the phrase ‘machine learning using python’.The reason I’ve chosen the Natural Language Toolkit for this talk is that it has a particular focus on document analysis. As such, it contains some very useful tools for manipulating text that don’t come as standard in some of the other python-based alternatives. Since it also contains a number of fairly standard machine learning classifiers, it is a great tool to get started with automated document classification.
  5. This diagram outlines the process for building an automatic classifier using supervised machine learning. The bottom line presents the end-goal, which is to be able to:Take a document that we do not have a classification forExtract some features from that document, andRun those features through our trained classifier, so thatWe can get a predicted (and accurate) classification for the document. In order to achieve this, the classifier has to be developed (trained) using a set of documents for which the classification is known. Ideally, the development set will contain a large number of documents that are representative of the kind of documents you’ll want to classify automatically. For each document in the development set you’ll need to extract a range of features to feed into the classifier during the training stage. Features could include things like the number of words in the document, whether or not a specific word (unigram) or group of words (ngram) appears in the document, who the sender of the document was, the percentage of the document written in ALLCAPS, and so on.Different machine learning algorithms can be used, and some are more appropriate for document classification tasks than others. But we’ll talk about that some more later. At this stage, the key things to take away from the diagram are that the quality of your development set, your choice of features to extract, and the machine learning algorithm you select will all have an impact on the reliability of your trained classifier.
  6. The modules listed here come bundled with the NLTK. They can be used in the feature extraction and machine learning phases of the document classification process. These are just a sampling of what is available, so you should have a look through the NLTK documentation to identify other components that could fit your use-case.Looking first at feature extraction:The corpus module contains ‘words’ and ‘stopwords’. ‘words’ is a list of English words, which can be useful if you want to check whether your documents contain a high proportion of common English words. You might do this if you are building a classifier to tell if the document is English or not. You could also use it to help detect if a document contained high levels of ‘jargon’ terms. ‘stopwords’ is a list of high frequency words such as ‘the’, ‘to’ and ‘also’ that you often want to filter out of the set of words in a document that make it through to the feature list.The stem module contains tools for normalising the words in document to their common root. For instance, the words ‘programmer’, ‘programs’, ‘programming’ and ‘program’ would all be stemmed to ‘program’ by the Lancaster stemmer. Stemmers are useful because they help your classification algorithm use the root of the word as a signal to pick up on, rather than using different variants of the word as different signals.The tokenize module provides tools to split your text down from a collection of words and symbols into discrete units. For instance, the WordPunctTokenizer breaks a text into words or sub-words based on the presence of punctuation and whitespace. Tokenizing is the common way to get a list of unigrams from your document that can then be included as features to be passed to your classifier for training.You can try out different stemmers, tokenizers and other NLTK tools on the streamhacker blog by Jacob Perkins (see http://streamhacker.com or http://text-processing.com/demo/)NLTK also contains tools for finding and selecting collocations, or words that commonly appear together. For instance, the Bigram tools help identify and select two-word combinations that appear throughout a text. There is also a set related to Trigrams (three word combinations). The presence (or not) of these word combinations can also be included as features to be passed in to your classifier for training.Turning to machine learning algorithms, the Natural Language Toolkit contains a handful of commonly used classifiers including a Naïve Bayesian classifier, a Decision Tree classifier and a Multinomial Logit classifier (called Maxent). It also provides an interface to the Java-based ‘Weka’ collection of open source machine learning algorithms from the University of Waikato. It is beyond the scope of this talk, and my knowledge, to go into the details of each classifier. However, each classifier has its own profile of computational requirements, flexibility and interpretability. So, it is worth investigating the most appropriate classifier for your purpose and also testing different classifiers to see which produce the best outcomes for your datasets.In addition to the classification algorithms themselves, NLTK provides some tools for assessing the performance of your trained classifier under the ‘classify.utils’ module. The ‘accuracy’ test is just one of these, and we’ll look at some alternative ways to assess your classifier when we run through the example.
  7. I used a Naïve Bayesian classifier for today’s example. In essence, this classifier generates an estimate of the likelihood that a document is of the class of interest (or label, in the above formulae) by looking at the features that it has and computing the likelihood that a document with those features would be of the class of interest. The likelihood for each feature and label is determined during the training phase, by iterating through the development set of documents for which both the features and label are known. Thus, the classifier extrapolates the probabilities it establishes during training to new documents it encounters. The classifier is Bayesian because it employs Bayes&apos; theorem to build up conditional probability distributions for the labels and features. It is considered naïve because it assumes that the occurrence of one feature is unrelated to the occurrence of another feature. That is, it assumes the features are independent. This assumption is frequently incorrect. Nevertheless, Bayesian classifiers have proven to perform well despite this apparent flaw in logic! Those interested in learning more about the mathematics underlying the classifier should head over the related Wikipedia entries for ‘naïve Bayesian classifier’ and ‘Bayesian spam filtering’, which contain some worked examples.
  8. Now that we have covered the process and components necessary to build a document classifier using the NLTK, I’d like to jump into an example. For those of you who haven’t heard about Enron, it was a very large US energy and commodities trading company which went spectacularly bankrupt in late 2001. The ripples of its bankruptcy were felt throughout the American financial and political system. Subsequent investigations surrounding its accounting practices led to the arrest and imprisonment of a number of its key executives, along with the downfall of one of the world’s largest accounting firms, Arthur Andersen.As part of the investigations that went on, Enron was required to provide prosecutors with the organisation’s emails for a period prior to the collapse. Those emails subsequently made it into the public domain, and I thought they’d be an interesting set of documents to build some classifiers on for this talk.
  9. In total, just over half a million emails were released as part of the Enron document discovery process. That’s a lot of documents to try and look through if you are a prosecutor attempting to identify key documents for your case. It is likely to be prohibitively expensive to hire a group of people (even interns!) to look through all of the documents. It would probably also be very difficult to build a classifier to try and pick out ‘key documents’ since, by definition, these are rare. If you think about these key documents as needles in a haystack, you would spend an inordinate amount of time trying to find examples to feed into your development set. By the time you’d found them, you would have probably looked through half of the emails already!Of course, if you had some keywords you knew would exist in the documents, or you knew who the key players were in the case, you could search for key terms or only read the emails from certain individuals. But if these avenues were exhausted, another approach you could take is to try and reduce the size of the haystack.Specifically, if you could identify the types of documents that are very unlikely to contain information you are interested in, and then have a classifier run over the entire set to identify documents that are likely to be of those types, you could reduce your workload significantly. This would allow you to focus your efforts on documents that weren’t clearly ‘hay’.
  10. Although the full set of 517 thousand messages was available as an archive, it had not had much in the way of elementary data cleaning done to it. So, instead I have used a subset of just over 20 thousand messages that have already gone through some cleaning. There is a link to the download at the end of the talk for those interested. One point to note is that the subset comes from seven Enron employees who had a particularly large number of emails in their folders. So, the subset is not likely to be representative of the entire message set and, as a result, any classifiers built on them could not be expected to perform well over the entire set.The messages came as separate plaintext files within a folder structure reflecting the owner and email subfolder the message was assigned by the owner. Although I have not used these elements of structure as features in developing the classifiers here, this sort of information would be relevant to the classification task and might be included in a more comprehensive modelling project.I used the MySQL ODBC interface for python, along with functions from the standard os and string python modules, to import each plaintext file, separate out some key fields such as the sender, recipients, subject line, and message body, and finally insert a row into a dedicated MySQL table.
  11. As part of the extraction process I also allocated a random number to each message and created some derived fields such as the number of people the message was sent to, the number of words in the body, and the percentage of words that were in all-caps.The random number was used to select a sub-sample from the full table to use as a development set, while the derived fields were created as features to be fed into the Bayesian classifier.As an aside, I’ve found that it is generally a good idea to allocate a random number to each record when you are creating a dataset. Most databases, be they relational or NoSQL, don’t handle random selection tasks very well, so it can save time to have way of randomly ordering records pre-baked in to your data.Those interested in the details can download the code at benhealey.info. Just do a search for KiwiPycon to find the files.It is worth noting that more cleaning could have been done to the messages. For instance, a number of the messages are forwarded or replies. So, they contain two or more messages in one, along with the original message metadata. This sort of duplication and noise might confuse a learning algorithm. Alternatively, they may also help the algorithm pick up on key signals about the message context. Whatever the case, an attempt would ideally be made to train the classifier with and without the presence of these confounds to determine the effect on model accuracy.
  12. Here is one of the messages from the set. I classified this as a legal/regulatory message, but it is possible that someone else might classify it as an administrative message since it pertains to a meeting agenda. The key point is that any ambiguity inherent in the message context will likely flow through into ambiguity in the training phase and degrade the performance of the classifier.
  13. This message is a little more clear cut. I’ve classified it as an IT message, since it pertains to the output from an automated process. At a glance we can see that messages such as this are likely to contain signals that should be easily picked up by a classifier. For instance, the message is short, contains a number of all caps words, and is likely to contain symbols such as ‘--&gt;&gt;’ that would not normally appear in ‘human’ messages. Look at the message a few seconds more and you will probably come up with a handful of other signals that might help a classifier distinguish this sort of message from another. This is a key way to come up with features to feed into your classifier. So, having a human parse some of the documents is a key part of the idea generation and model training phase.
  14. Another aspect that may require human involvement is in assigning a class to the development set. For this example, I went through and tagged a random sample of 1000 messages with one or more classes. Don’t try this at home. It is quite boring. The more sensible approach, if feasible, would be to outsource the tagging to a crowdsourcing service such as Amazon Mechanical Turk or CrowdFlower. I did actually attempt to put my sample through CrowdFlower, but came up against a number of formatting issues with the messages. These, and the looming conference deadline, led me to decide to tag them myself using a hastily created MSAccess form instead!If you are particularly lucky you may have a development set that is already tagged, or that it is trivial to tag automatically. For instance, if you were building a classifier to predict which stocks would rise in value in future, you would be able to tag an historic training set very easily using historic stock price information.A couple of points to note from this chart are:My intent here is to build one classifier per class. For instance, I’ll build one classifier that predicts whether a given message is related to External Relations, or not. Another classifier will predict whether a given message is related to Info Tech or not, and so on. Just be aware that there are classifiers that will allow you to predict across multiple classes. Indeed, you could even combine the results from the separate models in this example to get one ‘overall predicted class’ for each message. However, this is beyond the scope of this talk and my knowledge at this point!You’ll notice that the ‘External Relations’ and ‘Social/Personal’ classes doesn’t have many messages in them. This is likely to hinder development of an accurate classifier for those classes, since the classifier will not have many examples to train on. In contrast, there is a better chance we will be able to build an accurate classifier for the IT, Regulatory, and Deals classes.
  15. With the development set of 1000 randomly selected messages in place, the next step is to pull out a range of features for the classifier to train on. I’ve taken a kitchen sink, ‘one size fits all’ approach for the models in this talk, but with more time I’d select features for each model in an iterative process. For instance, you’d throw in a bunch of features into the first model (say, for IT), check out its accuracy, then whittle down or modify the features and see what effect that has on accuracy. What you are aiming for is a model that is both robust (i.e., it gives good predictions against a test set of messages) and parsimonious (i.e., it uses the least number of features necessary to achieve its accuracy). Simply throwing the kitchen sink at your model, as I have done here, increases the chance of overfitting, which is where the model performs well on the training set but poorly when you attempt to apply it to documents that it was not trained on.This slide lists the core NLTK modules I used to expand out the feature set for my models. You can download the code relating to this at benhealey.info
  16. A couple of points worth noting about the feature development stage is that you’ll generally be aiming to generate features from your documents that are either Boolean or nominal (aka, categorical). By nominal I mean a short list of possible categories, not necessarily with any inherent order. The example on this slide shows a set of categories for the ‘num_words_in_body’ feature which do have an inherent order (‘long’ messages are larger than ‘medium’ messages), but a Bayesian classifier is ignorant of this. As far as it is concerned, you might as well be giving it a list of colors. All it cares about is that you are giving it a feature which casts a message into a specific, mutually exclusive group based on this thing you’ve called ‘num_words_in_body’.
  17. A Boolean feature is formatted as its name suggests. The token-based features (ngrams, bigrams etc.) extracted using some of the NLTK tools are generally presented in Boolean format to classifiers for training; either the feature exists for the document, or it doesn’t.Note that you could also feed numeric features into the classifier. However, you need to take care to avoid a situation where the feature can take on so many different values that there are not enough instances of each value in the development set for the classifier to build up good history of that instance’s co-occurrence with each document class. This is why I created the ‘num_words_in_body’ feature as a series of bins rather than leaving it as a discrete numeric variable.One final thing to note is that all of the features I’ve extracted are ones that I will be able to extract for any new, unclassified, documents I want to feed into the classifier. If they weren’t, I couldn’t expect any resulting trained classifier to accurately predict the class for new documents.
  18. Much of the grunt work has now been done. The data is in a reasonable state for modelling, a range of features have been extracted, and the development set has at least one class assigned to each document. The next step is to train a classifier using the data and assess its accuracy. NLTK makes this relatively simple. As the slide shows, the development set is split into two chunks at random. The first chunk, the training set, is used to train the classifier. The second chunk, the test set, is use to test the model’s accuracy. This second chunk is sometimes also referred to as a hold-out sample, and acts as a tool for determining how the classifier might perform ‘in practice’ if we were to ask it to predict the classes for documents that it had not seen during its training.You can test a model more rigorously than shown in this slide, by performing multiple rounds of cross-validation. But this is enough to provide a good overview of the process.You’ll see here that I have used two of NLTK tools for assessing accuracy: the accuracy module and the ‘show_most_informative_features’ method of the classifier itself. Again, I’ve used these for illustrative purposes but there are other testing metrics that could also be applied to assess the accuracy of the model.Again, you can get my code for this from benhealey.info
  19. Here are those two performance metrics for the ‘IT’ classifier. The accuracy figure of 0.93 indicates that the classifier allocated the correct class (IT or not IT) to 93% of the test cases. This is a promising sign, and not unexpected given the general structure of these types of messages.The ‘most informative features’ list outlines the features that are most useful to the model in helping it distinguish ‘IT’ messages from other messages, along with the likelihood ratios for those features. It shows, for instance, that messages containing the word (or token) ‘txt’ are IT-related about 150 times more often than they are non IT-related. The features listed make sense given the IT context, so we can gain some confidence that the model is working OK.
  20. This slide shows the same metrics for a classifier built to distinguish ‘deal or trading’ messages from other messages. Again, the results are generally promising. Accuracy is not as good as for the IT model, but it is still reasonable, and the most informative features make sense. The terms ‘nom’ and ‘nomin’ are common trading terms, while ‘mmbtu’ and ‘txu’ relate to volume units.The likelihood ratios are not as high as some of those identified for the IT model, so this model may have more difficulty distinguishing between ‘deal’ and ‘non-deal’ message types.
  21. And here are the metrics for a poorly performing model: ‘social/personal’.The accuracy figure of 0.48 suggests this classifier is having quite a bit of difficulty distinguishing ‘social’ messages from others. If you cast your mind back to the chart I presented earlier you’ll remember that very few messages were tagged as ‘social’ in the training set, so the classifier didn’t have much to go on. Given that, this result is not surprising. More work would need to be done to get more training cases for this class or to tease out some more features that might give the classifier a better chance of predicting correctly.One positive is that the informative features seem to make sense given the context. Most of the terms are those you would expect to see more often in personal or social messages, and less often in other types of messages.
  22. Although the accuracy metric presented earlier can be a good indicator of classifier performance, it is dependant on the rate of incidence of a class in the test set. For example, imagine a classifier for the class “hen’s tooth”, where only 1 out of 100 test cases is actually a hen’s tooth. If the “hen’s tooth” classifier was really bad, so that all it did was classify each document as “not a hen’s tooth”, then it would classify 99 out of the 100 test cases correctly. That is, even a very bad model can theoretically achieve an accuracy score of 99%.It is therefore wise to consider a variety of performance metrics when assessing a classifier. This slide presents two alternative ways of looking at model performance that I find useful. They rely on the fact that the NLTK classifiers produce both a predicted class given document along with the probability score associated with the predicted class. You can take the probability scores for all documents in your test set, order them from lowest to highest, and assign each document to a decile (9 being the 10% of documents with the highest probability of being your class of interest). The table and chart above are based on these deciles, and show what the actual classes were for the documents in each decile. Note, I’ve actually produced these using my full development set of 1000 cases, rather than just the training set of 333. You shouldn’t really do this, but I wanted to use all of my cases to give a little stability to the figures produced. So, bear in mind that these are ‘cheat’ scores for the models.The percentages in the table should be read along each row, and give an indication of where the model is getting confused. Percentages less than 10% have been supressed. For the IT model there is very little confusion: 95% of the cases in the top decile were actually IT messages, and 50% of the cases in decile 8 were IT messages. For all other deciles the percentage of cases that were IT was less than 10%.The chart presents information only for ‘hits’; that is, those cases that were classed as ‘IT’ (167 out of 1000). It shows that around 55% of those cases appeared in the top decile, and that around 85% of the cases appeared in the top 2 deciles. As a comparison, if the classifier was simply allocating probabilities at random, you’d expect the ‘hits’ to be evenly distributed across the deciles, so around 10% of cases would be in the top decile, 20% would be in the top 2 deciles, and so on. This is what the dashed blue line represents.For an ideal classifier you would see all of the ‘hits’ appearing in the top deciles, with the cumulative line going to 100% very quickly. Overall, then, these results indicate the classifier performs very well at classifying IT messages.
  23. Looking at the charts for the ‘deal’ classifier, it appears that this is also performing well, although perhaps not as well as the IT classifier. The table suggests that the classifier is having some difficulty distinguishing ‘deal’ messages from ‘other’ and ‘legal’ messages, so some more work could be done to tease out features that might help it there.
  24. Finally, the ‘social’ classifier is performing about as badly as we might have expected from early indications!The table suggests that the model is having difficulty distinguishing ‘social’ messages from other messages across most types. This is reinforced by the information in the chart, which has the model performing only slightly better than what we might expect by chance.At this point, if we cast our focus back to the prosecutor in the Enron case, we can start to see a way forward with automated document classification. Even after this ‘beta’ run-through, the persecutor could be confident that a classifier would help them eliminate IT messages (estimated to be around 17% of all messages in the sample) from the haystack. Similarly, if some deal messages were likely to contain information of interest to the case, there is now a classifier that can be used to identify those messages from the haystack that should be prioritised for human review. And, as those messages were reviewed more information would become available that could then be used to feed into further development of the classifiers.
  25. So there you have it. We’ve covered the process for developing automated document classifiers using the NLTK and python, with the Enron emails being an illustrative example. Hopefully you’ll be able to see how this sort of modelling could be useful to you professionally or personally.I’ve mentioned a few ‘gotchas’ along the way, but it is worth reiterating them before I finish up, since they can have a big impact on the success of your classifiers.Try to avoid using a biased sample to train your classifiers. If your development set doesn’t contain the same general mix of documents as those you expect to apply your classifier to, you can’t really expect your classifier to perform well in practice.Be careful of the ‘accuracy’ metric, particularly if you are building a classifier for a rare document type. There are methods for dealing with modelling situations for rare events which I’ve not covered here, and it pays to take a look at a few different performance metrics for your model before you finalise and rely on it.Your prior knowledge, or that of subject matter experts, can be very useful for identifying key features to give your classifier for training. Ultimately, you are trying to get the classifier to pick up on knowledge that you probably already have as a human, so that you can apply that knowledge at scale. So, you need to think about what it is that enables you as a human to distinguish one document type from another and formalise that as much as possible so that it can be used in an algorithm.Don’t expect the first classifier you build to be the best one. Like most things in life, there is a learning cycle involved here. You’ll probably go through a number of ‘draft’ classifiers before you settle on a final classifier that is reliable and parsimonious. Even then, once the classifier is in production it will need to be tweaked as new knowledge about the domain becomes available.
  26. Those interested in trying their hand at automated classification will find the following resources useful. The first link is to the NLTK website. There, you’ll also find a link to a book covering the toolkit, natural language processing, and document classification.The second link is to a downloadable archive of the Enron emails.Those interested in learning more about machine learning should also consider taking the free online course being offered through Stanford in October to November 2011. You can sign up at ml-class.comFinally, Jacob Perkins runs a blog about python, natural language processing and machine learning which contains a number of very useful examples, demos, and articles.