Corpus Bootstrapping with NLTK: Less Time, Lower Costs

•

14 likes•17,180 views

Jacob Perkins

Presented at Strata 2012 Deep Data session.

Technology Business

Corpus Bootstrapping with NLTK
by Jacob Perkins

Jacob Perkins

http://www.weotta.com
http://streamhacker.com
http://text-processing.com
https://github.com/japerk/nltk-trainer
@japerk

Problem

you want to do NLProc
many proven supervised training algorithms
but you don’t have a training corpus

Problems with Manual Annotation

takes time
requires expertise
expert time costs $$$

Solution: Bootstrap

less time
less expertise
costs less
requires thinking & creativity

Corpus Bootstrapping at Weotta

review sentiment
keyword classiﬁcation
phrase extraction & classiﬁcation

Bootstrapping Examples

english -> spanish sentiment
phrase extraction

Translating Sentiment

start with english sentiment corpus & classiﬁer
english -> spanish -> spanish

English -> Spanish -> Spanish

1. translate english examples to spanish
2. train classiﬁer
3. classify spanish text into new corpus
4. correct new corpus
5. retrain classiﬁer
6. add to corpus & goto 4 until done

Translate Corpus

$ translate_corpus.py movie_reviews --source english
--target spanish

Train Initial Classiﬁer

$ train_classifier.py spanish_movie_reviews

Create New Corpus

$ classify_to_corpus.py spanish_sentiment --input
spanish_examples.txt --classifier
spanish_movie_reviews_NaiveBayes.pickle

Manual Correction

1. scan each ﬁle
2. move incorrect examples to correct ﬁle

Train New Classiﬁer

$ train_classifier.py spanish_sentiment

Adding to the Corpus

start with >90% probability
retrain
carefully decrease probability threshold

Add more at a Lower Threshold

$ classify_to_corpus.py categorized_corpus --
classifier categorized_corpus_NaiveBayes.pickle --
threshold 0.8 --input new_examples.txt

When are you done?

what level of accuracy do you need?
does your corpus reﬂect real text?
how much time do you have?

Tips

garbage in, garbage out
correct bad data
clean & scrub text
experiment with train_classifier.py options
create custom features

Bootstrapping a Phrase Extractor
1. ﬁnd a pos tagged corpus
2. annotate raw text
3. train pos tagger
4. create pos tagged & chunked corpus
5. tag unknown words
6. train pos tagger & chunker
7. correct errors
8. add to corpus, goto 5 until done

NLTK Tagged Corpora

English: brown, conll2000, treebank
Portuguese: mac_morpho, ﬂoresta
Spanish: cess_esp, conll2002
Catalan: cess_cat
Dutch: alpino, conll2002
Indian Languages: indian
Chinese: sinica_treebank
see http://text-processing.com/demo/tag/

Train Tagger

$ train_tagger.py treebank --simplify_tags

Phrase Annotation

Hello world, [this is an important phrase].

Tag Phrases

$ tag_phrases.py my_corpus --tagger
treebank_simplify_tags.pickle --input my_phrases.txt

Chunked & Tagged Phrase

Hello/N world/N ,/, [ this/DET is/V an/DET
important/ADJ phrase/N ] ./.

Correct Unknown Words

1. ﬁnd -NONE- tagged words
2. ﬁx tags

Train New Tagger

$ train_tagger.py my_corpus --reader
nltk.corpus.reader.ChunkedCorpusReader

Train Chunker

$ train_chunker.py my_corpus --reader
nltk.corpus.reader.ChunkedCorpusReader

$Extracting Phrases import collections, nltk.data from nltk import tokenize from nltk.tag import untag tagger = nltk.data.load('taggers/my_corpus_tagger.pickle') chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle') def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d sents = tokenize.sent_tokenize(text) words = tokenize.word_tokenize(sents[0]) d = extract_phrases(chunker.parse(tagger.tag(words))) # defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})$

Final Tips

error correction is faster than manual annotation
ﬁnd close enough corpora
use nltk-trainer to experiment
iterate -> quality
no substitute for human judgement

Links

http://www.nltk.org
https://github.com/japerk/nltk-trainer
http://text-processing.com

Viewers also liked

Nltk natural language toolkit overview and application @ PyHugJimmy Lai

Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir

ZOETWITT in the Presszoetwitt

Natural Language Toolkit (NLTK), Basics Prakash Pimpale

Basic NLP with Python and NLTKFrancesco Bruni

NLTKGirish Khanzode

Lanyrd ProLanyrd

Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Knowledge Media Institute - The Open University

Lanyrd's new integrations with EventbriteLanyrd

Practical Natural Language ProcessingJaganadh Gopinadhan

Character Encoding & Unicode - How to (╯°□°）╯︵ ┻━┻ with dignityTravis Fischer

Conversational Internet - Creating a natural language interface for web pagesDale Lane

Automatic Language Identificationbigshum

Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters

Introduction to word embeddings with PythonPavel Kalaidin

Lightweight Natural Language Processing (NLP)Lithium

Viewers also liked (16)

Nltk natural language toolkit overview and application @ PyHug

Nltk:a tool for_nlp - py_con-dhaka-2014

ZOETWITT in the Press

Natural Language Toolkit (NLTK), Basics

Basic NLP with Python and NLTK

NLTK

Lanyrd Pro

Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...

Lanyrd's new integrations with Eventbrite

Practical Natural Language Processing

Character Encoding & Unicode - How to (╯°□°）╯︵ ┻━┻ with dignity

Conversational Internet - Creating a natural language interface for web pages

Automatic Language Identification

Deep Learning for NLP: An Introduction to Neural Word Embeddings

Introduction to word embeddings with Python

Lightweight Natural Language Processing (NLP)

Similar to Corpus Bootstrapping with NLTK: Less Time, Lower Costs

Howto Test A Patch And Make A Difference!Joel Farris

Oop in php_tutorialGregory Hanis

An Introduction to NLP4LKoji Sekiguchi

The Essential Perl Hacker's ToolkitStephen Scaffidi

Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010singingfish

Behavior driven development (bdd)Rohit Bisht

Le PERL est mortapeiron

TDD with PhpSpec - Lone Star PHP 2016CiaranMcNulty

Embrace dynamic PHPPaul Houle

Oops in PHPMindfire Solutions

Php test festBarry O Sullivan

Php extensionsElizabeth Smith

Stuff They Never Taught You At Website SchoolMyles Eftos

PerlScriptingAureliano Bombarely

The top 10 things that any pro PHP developer should be doingKacper Gunia

Article 01 What Is Phpdrperl

Getting Started With Apex as an Admin by Christopher LewisSalesforce Admins

Df16 getting started with apex as an adminChristopher Lewis

PhpSpec: practical introductionDave Hulbert

oop_in_php_tutorial_for_killerphp.comtutorialsruby

Similar to Corpus Bootstrapping with NLTK: Less Time, Lower Costs (20)

Howto Test A Patch And Make A Difference!

Oop in php_tutorial

An Introduction to NLP4L

The Essential Perl Hacker's Toolkit

Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010

Behavior driven development (bdd)

Le PERL est mort

TDD with PhpSpec - Lone Star PHP 2016

Embrace dynamic PHP

Oops in PHP

Php test fest

Php extensions

Stuff They Never Taught You At Website School

PerlScripting

The top 10 things that any pro PHP developer should be doing

Article 01 What Is Php

Getting Started With Apex as an Admin by Christopher Lewis

Df16 getting started with apex as an admin

PhpSpec: practical introduction

oop_in_php_tutorial_for_killerphp.com

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

A Journey Into the Emotions of Software DevelopersNicole Novielli

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

What is Artificial Intelligence?????????blackmambaettijean

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Rise of the Machines: Known As Drones...Rick Flair

How to write a Business Continuity PlanDatabarracks

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!

Ensuring Technical Readiness For Copilot in Microsoft 365

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Unraveling Multimodality with Large Language Models.pdf

What is DBT - The Ultimate Data Build Tool.pdf

Time Series Foundation Models - current state and future directions

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

"Debugging python applications inside k8s environment", Andrii Soldatenko

A Journey Into the Emotions of Software Developers

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

TeamStation AI System Report LATAM IT Salaries 2024

What is Artificial Intelligence?????????

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

DevEX - reference for building teams, processes, and platforms

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Rise of the Machines: Known As Drones...

How to write a Business Continuity Plan

Corpus Bootstrapping with NLTK: Less Time, Lower Costs

1. Corpus Bootstrapping with NLTK by Jacob Perkins

2. Jacob Perkins http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk

3. Problem you want to do NLProc many proven supervised training algorithms but you don’t have a training corpus

4. Solution make a custom training corpus

5. Problems with Manual Annotation takes time requires expertise expert time costs $$$

6. Solution: Bootstrap less time less expertise costs less requires thinking & creativity

7. Corpus Bootstrapping at Weotta review sentiment keyword classiﬁcation phrase extraction & classiﬁcation

8. Bootstrapping Examples english -> spanish sentiment phrase extraction

9. Translating Sentiment start with english sentiment corpus & classiﬁer english -> spanish -> spanish

10. English -> Spanish -> Spanish 1. translate english examples to spanish 2. train classiﬁer 3. classify spanish text into new corpus 4. correct new corpus 5. retrain classiﬁer 6. add to corpus & goto 4 until done

11. Translate Corpus $ translate_corpus.py movie_reviews --source english --target spanish

12. Train Initial Classiﬁer $ train_classifier.py spanish_movie_reviews

13. Create New Corpus $ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle

14. Manual Correction 1. scan each ﬁle 2. move incorrect examples to correct ﬁle

15. Train New Classiﬁer $ train_classifier.py spanish_sentiment

16. Adding to the Corpus start with >90% probability retrain carefully decrease probability threshold

17. Add more at a Lower Threshold $ classify_to_corpus.py categorized_corpus -- classifier categorized_corpus_NaiveBayes.pickle -- threshold 0.8 --input new_examples.txt

18. When are you done? what level of accuracy do you need? does your corpus reﬂect real text? how much time do you have?

19. Tips garbage in, garbage out correct bad data clean & scrub text experiment with train_classifier.py options create custom features

20. Bootstrapping a Phrase Extractor 1. ﬁnd a pos tagged corpus 2. annotate raw text 3. train pos tagger 4. create pos tagged & chunked corpus 5. tag unknown words 6. train pos tagger & chunker 7. correct errors 8. add to corpus, goto 5 until done

21. NLTK Tagged Corpora English: brown, conll2000, treebank Portuguese: mac_morpho, ﬂoresta Spanish: cess_esp, conll2002 Catalan: cess_cat Dutch: alpino, conll2002 Indian Languages: indian Chinese: sinica_treebank see http://text-processing.com/demo/tag/

22. Train Tagger $ train_tagger.py treebank --simplify_tags

23. Phrase Annotation Hello world, [this is an important phrase].

24. Tag Phrases $ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt

25. Chunked & Tagged Phrase Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.

26. Correct Unknown Words 1. ﬁnd -NONE- tagged words 2. ﬁx tags

27. Train New Tagger $ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

28. Train Chunker $ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

29. Extracting Phrases import collections, nltk.data from nltk import tokenize from nltk.tag import untag tagger = nltk.data.load('taggers/my_corpus_tagger.pickle') chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle') def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d sents = tokenize.sent_tokenize(text) words = tokenize.word_tokenize(sents[0]) d = extract_phrases(chunker.parse(tagger.tag(words))) # defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

30. Final Tips error correction is faster than manual annotation ﬁnd close enough corpora use nltk-trainer to experiment iterate -> quality no substitute for human judgement

31. Links http://www.nltk.org https://github.com/japerk/nltk-trainer http://text-processing.com

Corpus Bootstrapping with NLTK: Less Time, Lower Costs

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Corpus Bootstrapping with NLTK: Less Time, Lower Costs

Similar to Corpus Bootstrapping with NLTK: Less Time, Lower Costs (20)

Recently uploaded

Recently uploaded (20)

Corpus Bootstrapping with NLTK: Less Time, Lower Costs