SlideShare a Scribd company logo
1 of 31
Download to read offline
Corpus Bootstrapping with NLTK
by Jacob Perkins
Jacob Perkins


 http://www.weotta.com
 http://streamhacker.com
 http://text-processing.com
 https://github.com/japerk/nltk-trainer
 @japerk
Problem



 you want to do NLProc
 many proven supervised training algorithms
 but you don’t have a training corpus
Solution




 make a custom training corpus
Problems with Manual Annotation



 takes time
 requires expertise
 expert time costs $$$
Solution: Bootstrap


 less time
 less expertise
 costs less
 requires thinking & creativity
Corpus Bootstrapping at Weotta



 review sentiment
 keyword classification
 phrase extraction & classification
Bootstrapping Examples



 english -> spanish sentiment
 phrase extraction
Translating Sentiment



 start with english sentiment corpus & classifier
 english -> spanish -> spanish
English -> Spanish -> Spanish

1. translate english examples to spanish
2. train classifier
3. classify spanish text into new corpus
4. correct new corpus
5. retrain classifier
6. add to corpus & goto 4 until done
Translate Corpus


$ translate_corpus.py movie_reviews --source english
--target spanish
Train Initial Classifier



$ train_classifier.py spanish_movie_reviews
Create New Corpus


$ classify_to_corpus.py spanish_sentiment --input
spanish_examples.txt --classifier
spanish_movie_reviews_NaiveBayes.pickle
Manual Correction



1. scan each file
2. move incorrect examples to correct file
Train New Classifier



$ train_classifier.py spanish_sentiment
Adding to the Corpus

 start with >90% probability
 retrain
 carefully decrease probability threshold
Add more at a Lower Threshold


$ classify_to_corpus.py categorized_corpus --
classifier categorized_corpus_NaiveBayes.pickle --
threshold 0.8 --input new_examples.txt
When are you done?



 what level of accuracy do you need?
 does your corpus reflect real text?
 how much time do you have?
Tips


 garbage in, garbage out
 correct bad data
 clean & scrub text
 experiment with train_classifier.py options
 create custom features
Bootstrapping a Phrase Extractor
1. find a pos tagged corpus
2. annotate raw text
3. train pos tagger
4. create pos tagged & chunked corpus
5. tag unknown words
6. train pos tagger & chunker
7. correct errors
8. add to corpus, goto 5 until done
NLTK Tagged Corpora

 English: brown, conll2000, treebank
 Portuguese: mac_morpho, floresta
 Spanish: cess_esp, conll2002
 Catalan: cess_cat
 Dutch: alpino, conll2002
 Indian Languages: indian
 Chinese: sinica_treebank
 see http://text-processing.com/demo/tag/
Train Tagger



$ train_tagger.py treebank --simplify_tags
Phrase Annotation


Hello world, [this is an important phrase].
Tag Phrases


$ tag_phrases.py my_corpus --tagger
treebank_simplify_tags.pickle --input my_phrases.txt
Chunked & Tagged Phrase


Hello/N world/N ,/, [ this/DET is/V an/DET
important/ADJ phrase/N ] ./.
Correct Unknown Words



1. find -NONE- tagged words
2. fix tags
Train New Tagger


$ train_tagger.py my_corpus --reader
nltk.corpus.reader.ChunkedCorpusReader
Train Chunker


$ train_chunker.py my_corpus --reader
nltk.corpus.reader.ChunkedCorpusReader
Extracting Phrases
import collections, nltk.data
from nltk import tokenize
from nltk.tag import untag

tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')
chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')

def extract_phrases(t):
    d = collections.defaultdict(list)
    for sub in t.subtrees(lambda s: s.node != 'S'):
        d[sub.node].append(' '.join(untag(sub.leaves())))
    return d

sents = tokenize.sent_tokenize(text)
words = tokenize.word_tokenize(sents[0])
d = extract_phrases(chunker.parse(tagger.tag(words)))
# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})
Final Tips


 error correction is faster than manual annotation
 find close enough corpora
 use nltk-trainer to experiment
 iterate -> quality
 no substitute for human judgement
Links



http://www.nltk.org
https://github.com/japerk/nltk-trainer
http://text-processing.com

More Related Content

Viewers also liked

Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
Fasihul Kabir
 
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Knowledge Media Institute - The Open University
 
Lanyrd's new integrations with Eventbrite
Lanyrd's new integrations with EventbriteLanyrd's new integrations with Eventbrite
Lanyrd's new integrations with Eventbrite
Lanyrd
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Jaganadh Gopinadhan
 
Automatic Language Identification
Automatic Language IdentificationAutomatic Language Identification
Automatic Language Identification
bigshum
 

Viewers also liked (16)

Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
ZOETWITT in the Press
ZOETWITT in the PressZOETWITT in the Press
ZOETWITT in the Press
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTK
 
NLTK
NLTKNLTK
NLTK
 
Lanyrd Pro
Lanyrd ProLanyrd Pro
Lanyrd Pro
 
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
 
Lanyrd's new integrations with Eventbrite
Lanyrd's new integrations with EventbriteLanyrd's new integrations with Eventbrite
Lanyrd's new integrations with Eventbrite
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityCharacter Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
 
Conversational Internet - Creating a natural language interface for web pages
Conversational Internet - Creating a natural language interface for web pagesConversational Internet - Creating a natural language interface for web pages
Conversational Internet - Creating a natural language interface for web pages
 
Automatic Language Identification
Automatic Language IdentificationAutomatic Language Identification
Automatic Language Identification
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Introduction to word embeddings with Python
Introduction to word embeddings with PythonIntroduction to word embeddings with Python
Introduction to word embeddings with Python
 
Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)
 

Similar to Corpus Bootstrapping with NLTK

Howto Test A Patch And Make A Difference!
Howto Test A Patch And Make A Difference!Howto Test A Patch And Make A Difference!
Howto Test A Patch And Make A Difference!
Joel Farris
 
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
singingfish
 
Article 01 What Is Php
Article 01   What Is PhpArticle 01   What Is Php
Article 01 What Is Php
drperl
 
oop_in_php_tutorial_for_killerphp.com
oop_in_php_tutorial_for_killerphp.comoop_in_php_tutorial_for_killerphp.com
oop_in_php_tutorial_for_killerphp.com
tutorialsruby
 

Similar to Corpus Bootstrapping with NLTK (20)

Howto Test A Patch And Make A Difference!
Howto Test A Patch And Make A Difference!Howto Test A Patch And Make A Difference!
Howto Test A Patch And Make A Difference!
 
Oop in php_tutorial
Oop in php_tutorialOop in php_tutorial
Oop in php_tutorial
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
The Essential Perl Hacker's Toolkit
The Essential Perl Hacker's ToolkitThe Essential Perl Hacker's Toolkit
The Essential Perl Hacker's Toolkit
 
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
 
Behavior driven development (bdd)
Behavior driven development (bdd)Behavior driven development (bdd)
Behavior driven development (bdd)
 
Le PERL est mort
Le PERL est mortLe PERL est mort
Le PERL est mort
 
TDD with PhpSpec - Lone Star PHP 2016
TDD with PhpSpec - Lone Star PHP 2016TDD with PhpSpec - Lone Star PHP 2016
TDD with PhpSpec - Lone Star PHP 2016
 
Embrace dynamic PHP
Embrace dynamic PHPEmbrace dynamic PHP
Embrace dynamic PHP
 
Oops in PHP
Oops in PHPOops in PHP
Oops in PHP
 
Php test fest
Php test festPhp test fest
Php test fest
 
Php extensions
Php extensionsPhp extensions
Php extensions
 
Stuff They Never Taught You At Website School
Stuff They Never Taught You At Website SchoolStuff They Never Taught You At Website School
Stuff They Never Taught You At Website School
 
PerlScripting
PerlScriptingPerlScripting
PerlScripting
 
The top 10 things that any pro PHP developer should be doing
The top 10 things that any pro PHP developer should be doingThe top 10 things that any pro PHP developer should be doing
The top 10 things that any pro PHP developer should be doing
 
Article 01 What Is Php
Article 01   What Is PhpArticle 01   What Is Php
Article 01 What Is Php
 
Getting Started With Apex as an Admin by Christopher Lewis
Getting Started With Apex as an Admin by Christopher LewisGetting Started With Apex as an Admin by Christopher Lewis
Getting Started With Apex as an Admin by Christopher Lewis
 
Df16 getting started with apex as an admin
Df16  getting started with apex as an adminDf16  getting started with apex as an admin
Df16 getting started with apex as an admin
 
PhpSpec: practical introduction
PhpSpec: practical introductionPhpSpec: practical introduction
PhpSpec: practical introduction
 
oop_in_php_tutorial_for_killerphp.com
oop_in_php_tutorial_for_killerphp.comoop_in_php_tutorial_for_killerphp.com
oop_in_php_tutorial_for_killerphp.com
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Corpus Bootstrapping with NLTK