Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Corpus Bootstrapping with NLTK Slide 1 Corpus Bootstrapping with NLTK Slide 2 Corpus Bootstrapping with NLTK Slide 3 Corpus Bootstrapping with NLTK Slide 4 Corpus Bootstrapping with NLTK Slide 5 Corpus Bootstrapping with NLTK Slide 6 Corpus Bootstrapping with NLTK Slide 7 Corpus Bootstrapping with NLTK Slide 8 Corpus Bootstrapping with NLTK Slide 9 Corpus Bootstrapping with NLTK Slide 10 Corpus Bootstrapping with NLTK Slide 11 Corpus Bootstrapping with NLTK Slide 12 Corpus Bootstrapping with NLTK Slide 13 Corpus Bootstrapping with NLTK Slide 14 Corpus Bootstrapping with NLTK Slide 15 Corpus Bootstrapping with NLTK Slide 16 Corpus Bootstrapping with NLTK Slide 17 Corpus Bootstrapping with NLTK Slide 18 Corpus Bootstrapping with NLTK Slide 19 Corpus Bootstrapping with NLTK Slide 20 Corpus Bootstrapping with NLTK Slide 21 Corpus Bootstrapping with NLTK Slide 22 Corpus Bootstrapping with NLTK Slide 23 Corpus Bootstrapping with NLTK Slide 24 Corpus Bootstrapping with NLTK Slide 25 Corpus Bootstrapping with NLTK Slide 26 Corpus Bootstrapping with NLTK Slide 27 Corpus Bootstrapping with NLTK Slide 28 Corpus Bootstrapping with NLTK Slide 29 Corpus Bootstrapping with NLTK Slide 30 Corpus Bootstrapping with NLTK Slide 31
Upcoming SlideShare
NLTK in 20 minutes
Next
Download to read offline and view in fullscreen.

15 Likes

Share

Download to read offline

Corpus Bootstrapping with NLTK

Download to read offline

Presented at Strata 2012 Deep Data session.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Corpus Bootstrapping with NLTK

  1. Corpus Bootstrapping with NLTK by Jacob Perkins
  2. Jacob Perkins http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk
  3. Problem you want to do NLProc many proven supervised training algorithms but you don’t have a training corpus
  4. Solution make a custom training corpus
  5. Problems with Manual Annotation takes time requires expertise expert time costs $$$
  6. Solution: Bootstrap less time less expertise costs less requires thinking & creativity
  7. Corpus Bootstrapping at Weotta review sentiment keyword classification phrase extraction & classification
  8. Bootstrapping Examples english -> spanish sentiment phrase extraction
  9. Translating Sentiment start with english sentiment corpus & classifier english -> spanish -> spanish
  10. English -> Spanish -> Spanish 1. translate english examples to spanish 2. train classifier 3. classify spanish text into new corpus 4. correct new corpus 5. retrain classifier 6. add to corpus & goto 4 until done
  11. Translate Corpus $ translate_corpus.py movie_reviews --source english --target spanish
  12. Train Initial Classifier $ train_classifier.py spanish_movie_reviews
  13. Create New Corpus $ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle
  14. Manual Correction 1. scan each file 2. move incorrect examples to correct file
  15. Train New Classifier $ train_classifier.py spanish_sentiment
  16. Adding to the Corpus start with >90% probability retrain carefully decrease probability threshold
  17. Add more at a Lower Threshold $ classify_to_corpus.py categorized_corpus -- classifier categorized_corpus_NaiveBayes.pickle -- threshold 0.8 --input new_examples.txt
  18. When are you done? what level of accuracy do you need? does your corpus reflect real text? how much time do you have?
  19. Tips garbage in, garbage out correct bad data clean & scrub text experiment with train_classifier.py options create custom features
  20. Bootstrapping a Phrase Extractor 1. find a pos tagged corpus 2. annotate raw text 3. train pos tagger 4. create pos tagged & chunked corpus 5. tag unknown words 6. train pos tagger & chunker 7. correct errors 8. add to corpus, goto 5 until done
  21. NLTK Tagged Corpora English: brown, conll2000, treebank Portuguese: mac_morpho, floresta Spanish: cess_esp, conll2002 Catalan: cess_cat Dutch: alpino, conll2002 Indian Languages: indian Chinese: sinica_treebank see http://text-processing.com/demo/tag/
  22. Train Tagger $ train_tagger.py treebank --simplify_tags
  23. Phrase Annotation Hello world, [this is an important phrase].
  24. Tag Phrases $ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt
  25. Chunked & Tagged Phrase Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.
  26. Correct Unknown Words 1. find -NONE- tagged words 2. fix tags
  27. Train New Tagger $ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader
  28. Train Chunker $ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader
  29. Extracting Phrases import collections, nltk.data from nltk import tokenize from nltk.tag import untag tagger = nltk.data.load('taggers/my_corpus_tagger.pickle') chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle') def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d sents = tokenize.sent_tokenize(text) words = tokenize.word_tokenize(sents[0]) d = extract_phrases(chunker.parse(tagger.tag(words))) # defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})
  30. Final Tips error correction is faster than manual annotation find close enough corpora use nltk-trainer to experiment iterate -> quality no substitute for human judgement
  31. Links http://www.nltk.org https://github.com/japerk/nltk-trainer http://text-processing.com
  • MichellePalma19

    Nov. 27, 2021
  • afelicioni

    May. 5, 2017
  • MaryPatafria

    Jan. 20, 2017
  • NinaHoddBaks

    Apr. 14, 2015
  • nicolasgo

    Aug. 3, 2014
  • chathuwithana

    Nov. 20, 2013
  • mt7

    Sep. 13, 2013
  • astray0924

    Jul. 28, 2013
  • hsharmasshare

    Feb. 15, 2013
  • boudetch

    Jan. 14, 2013
  • mox601

    Jan. 5, 2013
  • tantrieuf31

    Dec. 19, 2012
  • andreazigg

    Oct. 25, 2012
  • fipblizip

    May. 5, 2012
  • infantiablue

    Apr. 12, 2012

Presented at Strata 2012 Deep Data session.

Views

Total views

18,774

On Slideshare

0

From embeds

0

Number of embeds

100

Actions

Downloads

210

Shares

0

Comments

0

Likes

15

×