Corpus Bootstrapping with NLTK


Published on

Presented at Strata 2012 Deep Data session.

Published in: Technology, Business

Corpus Bootstrapping with NLTK

  1. Corpus Bootstrapping with NLTKby Jacob Perkins
  2. Jacob Perkins @japerk
  3. Problem you want to do NLProc many proven supervised training algorithms but you don’t have a training corpus
  4. Solution make a custom training corpus
  5. Problems with Manual Annotation takes time requires expertise expert time costs $$$
  6. Solution: Bootstrap less time less expertise costs less requires thinking & creativity
  7. Corpus Bootstrapping at Weotta review sentiment keyword classification phrase extraction & classification
  8. Bootstrapping Examples english -> spanish sentiment phrase extraction
  9. Translating Sentiment start with english sentiment corpus & classifier english -> spanish -> spanish
  10. English -> Spanish -> Spanish1. translate english examples to spanish2. train classifier3. classify spanish text into new corpus4. correct new corpus5. retrain classifier6. add to corpus & goto 4 until done
  11. Translate Corpus$ movie_reviews --source english--target spanish
  12. Train Initial Classifier$ spanish_movie_reviews
  13. Create New Corpus$ spanish_sentiment --inputspanish_examples.txt --classifierspanish_movie_reviews_NaiveBayes.pickle
  14. Manual Correction1. scan each file2. move incorrect examples to correct file
  15. Train New Classifier$ spanish_sentiment
  16. Adding to the Corpus start with >90% probability retrain carefully decrease probability threshold
  17. Add more at a Lower Threshold$ categorized_corpus --classifier categorized_corpus_NaiveBayes.pickle --threshold 0.8 --input new_examples.txt
  18. When are you done? what level of accuracy do you need? does your corpus reflect real text? how much time do you have?
  19. Tips garbage in, garbage out correct bad data clean & scrub text experiment with options create custom features
  20. Bootstrapping a Phrase Extractor1. find a pos tagged corpus2. annotate raw text3. train pos tagger4. create pos tagged & chunked corpus5. tag unknown words6. train pos tagger & chunker7. correct errors8. add to corpus, goto 5 until done
  21. NLTK Tagged Corpora English: brown, conll2000, treebank Portuguese: mac_morpho, floresta Spanish: cess_esp, conll2002 Catalan: cess_cat Dutch: alpino, conll2002 Indian Languages: indian Chinese: sinica_treebank see
  22. Train Tagger$ treebank --simplify_tags
  23. Phrase AnnotationHello world, [this is an important phrase].
  24. Tag Phrases$ my_corpus --taggertreebank_simplify_tags.pickle --input my_phrases.txt
  25. Chunked & Tagged PhraseHello/N world/N ,/, [ this/DET is/V an/DETimportant/ADJ phrase/N ] ./.
  26. Correct Unknown Words1. find -NONE- tagged words2. fix tags
  27. Train New Tagger$ my_corpus --readernltk.corpus.reader.ChunkedCorpusReader
  28. Train Chunker$ my_corpus --readernltk.corpus.reader.ChunkedCorpusReader
  29. Extracting Phrasesimport collections, nltk.datafrom nltk import tokenizefrom nltk.tag import untagtagger = = extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != S): d[sub.node].append( .join(untag(sub.leaves()))) return dsents = tokenize.sent_tokenize(text)words = tokenize.word_tokenize(sents[0])d = extract_phrases(chunker.parse(tagger.tag(words)))# defaultdict(<type list>, {PHRASE_TAG: [phrase]})
  30. Final Tips error correction is faster than manual annotation find close enough corpora use nltk-trainer to experiment iterate -> quality no substitute for human judgement
  31. Linkshttp://www.nltk.org