5. nsmc + crawling
Tokenizing/
POS tagging
KoNLPy (Twitter)
Word
Embedding
Word2Vec
Corpus
/
/ /
Corpus
Most similar words
165,810
73,614
40,501
20,390
20,701
10,604
*
‘ ’
Naver Sentiment Movie Corpus https://github.com/e9t/nsmc
MySQL
6. def tokenize(doc):
return ['/'.join(t) for t in twitter.pos(doc, norm=True, stem=True)]
tokenize(‘ , !’)
>>>
[‘ /Noun',
',/Punctuation',
' /Noun',
' /Noun',
' /Adjective',
'!/Punctuation']
KoNLPy twitter tokenizing/pos tagging
7. Tf-Idf / Multinomial Naive Bayes
:
clf = Pipeline([
('vect', TfidfVectorizer(min_df=10, ngram_range=(1, 3))),
('clf', MultinomialNB(alpha=0.001)),
])
model = clf.fit(X_train, y_train)
model
Performance
Accuracy Recall F1
Limitation: 5
( )