10. English -> Spanish -> Spanish
1. translate english examples to spanish
2. train classifier
3. classify spanish text into new corpus
4. correct new corpus
5. retrain classifier
6. add to corpus & goto 4 until done
16. Adding to the Corpus
start with >90% probability
retrain
carefully decrease probability threshold
17. Add more at a Lower Threshold
$ classify_to_corpus.py categorized_corpus --
classifier categorized_corpus_NaiveBayes.pickle --
threshold 0.8 --input new_examples.txt
18. When are you done?
what level of accuracy do you need?
does your corpus reflect real text?
how much time do you have?
19. Tips
garbage in, garbage out
correct bad data
clean & scrub text
experiment with train_classifier.py options
create custom features
20. Bootstrapping a Phrase Extractor
1. find a pos tagged corpus
2. annotate raw text
3. train pos tagger
4. create pos tagged & chunked corpus
5. tag unknown words
6. train pos tagger & chunker
7. correct errors
8. add to corpus, goto 5 until done
21. NLTK Tagged Corpora
English: brown, conll2000, treebank
Portuguese: mac_morpho, floresta
Spanish: cess_esp, conll2002
Catalan: cess_cat
Dutch: alpino, conll2002
Indian Languages: indian
Chinese: sinica_treebank
see http://text-processing.com/demo/tag/
29. Extracting Phrases
import collections, nltk.data
from nltk import tokenize
from nltk.tag import untag
tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')
chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')
def extract_phrases(t):
d = collections.defaultdict(list)
for sub in t.subtrees(lambda s: s.node != 'S'):
d[sub.node].append(' '.join(untag(sub.leaves())))
return d
sents = tokenize.sent_tokenize(text)
words = tokenize.word_tokenize(sents[0])
d = extract_phrases(chunker.parse(tagger.tag(words)))
# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})
30. Final Tips
error correction is faster than manual annotation
find close enough corpora
use nltk-trainer to experiment
iterate -> quality
no substitute for human judgement