한국어와 NLTK, Gensim의 만남

Eunjeong (Lucy) Park
Eunjeong (Lucy) ParkData Scientist
NLTK, Gensim
( : )
PyCon Korea 2015
2015-08-29
Lucy Park ( )
1 / 69
KoNLPy
,
? !
NLTK
,
Gensim
, .
bag-of-words, document embedding
, KoNLPy, NLTK, Gensim
.
KoNLPy: http://konlpy.org
NLTK: http://nltk.org
Gensim: http://radimrehurek.com/gensim
2 / 69
(a.k.a. lucypark,
echojuliett, e9t)
.
CAVEAT: KoNLPy
Just another yak shaver...
11:49 <@sanxiyn> 또다시 yak shaving의 신비한 세계
11:51 <@sanxiyn> yak shaving이 뭔지 다 아시죠?
11:51 <디토군> 방금 찾아보고 왔음
11:51 <@mana> (조용히 설명을 기대중)
11:51 <@sanxiyn> 나무를 베려고 하는데
11:52 <@sanxiyn> 도끼질을 하다가
11:52 <@sanxiyn> 도끼가 더 잘 들면 나무를 쉽게 벨텐데 해서
11:52 <@sanxiyn> 도끼 날을 세우다가
11:52 <@sanxiyn> 도끼 가는 돌이 더 좋으면 도끼 날을 더 빨리 세울텐데 해서
11:52 <@sanxiyn> 좋은 숫돌이 있는 곳을 수소문해 보니
11:52 <@mana> …
11:52 <&홍민희> 그거 전형적인 제 행동이네요
11:52 <@sanxiyn> 저 멀이 어디에 세계 최고의 숫돌이 난다고
11:52 <@sanxiyn> 거기까지 야크를 타고 가려다가
11:52 <@mana> 항상하던 짓이라서 타이핑을 할 수 없었습니다
11:52 <@sanxiyn> 야크 털을 깎아서…
11:52 <@sanxiyn> etc.
3 / 69
/
KoNLPy, NLTK, Gensim (?)
!
Gensim ,
.
,
.
.
?
3
4 / 69
1997
20 ...
5 / 69
IBM .
6 / 69
. ,
.
.
-- Bengio et al., Deep Learning 1 , Book in preparation for MIT Press, 2015.
7 / 69
" "
Representations of text
8 / 69
?
, .
" " .
9 / 69
{ , }
≈
⊂
10 / 69
, ,
(binary)
set
one-hot vector
1-of-V
= , = , = , . . . =w
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
1
0
0
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
w ( )
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
0
1
0
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
w
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
0
0
1
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
w
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
0
0
0
⋮
1
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
wi
i
|V|
11 / 69
.
ex: " " " ", " " " " 0 .
( = ( = 0w )⊤
w w )⊤
w
* BOW (discrete) (local representation) .
(distributed representation) 12 / 69
,
bag of words(BOW)
/ ( "term existance" )
n (a.k.a. term frequency (TF))
ex: term frequency–inverse document frequency (TF-IDF)
=d1
binary
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
1
1
0
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
=d1
tf
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
3
1
0
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
13 / 69
.*
.
.
" ".
* : Vincent Spruyt, The Curse of Dimensionality in classification, 2014. 14 / 69
" ?"
(WordNet) X is-a Y (hypernym)
(DAG; directed acyclic graph) . ( : .)
from nltk.corpus import wordnet as wn
wn.synsets('computer') # "computer"
#=> [Synset('computer.n.01'), Synset('calculator.n.01')]
: ( ) /
15 / 69
:
1. , .
from nltk.corpus import wordnet as wn
wn.synsets('nginx')
#=> []
2. .
wn.synsets('yolo') # : you only live once
#=> []
3. NLTK ... ( !)
wn.synsets(' ', lang='kor')
#=> WordNetError: Language is not supported.
16 / 69
.. . ?
17 / 69
.
18 / 69
.
You shall know a word by the company it keeps.
-- J.R. Firth (1957)
19 / 69
1: ?
__ .
, __ .
__ .
: [ ]
20 / 69
2: " " ?
?
21 / 69
.
?
SO .
22 / 69
;;
23 / 69
, (context)
.
(context) := (window) /
1-8 , 3-17 *
" "
ex: "PyCon" "R" "Python" ?
* Jurafsky, Martin, Speech and Language Processing 3rd Ed. 19.1 , Book in preparation, 2015. 24 / 69
"Co-occurrence"
25 / 69
Co-occurrence
documents = [[" ", " ", " ", " "],
[" ", "R", " ", " "]]
1. Term-document matrix ( ):
Computation (!)
x_td = [[1, 1], #
[1, 0], #
[0, 1], # R
[1, 1], #
[1, 1]] #
1. Term-term matrix ( ):
syntactic, semantic
( ==5):
x_tt = [[0, 1, 1, 2, 0], #
[1, 0, 0, 1, 1], #
[1, 0, 0, 1, 1], # R
[2, 1, 1, 0, 2], #
[0, 1, 1, 2, 0]] #
∈Xtd ℝ|V|×M
M
∈Xtt ℝ|V|×|V|
26 / 69
Co-occurrence matrix
.
import math
def dot_product(v1, v2):
return sum(map(lambda x: x[0] * x[1], zip(v1, v2)))
def cosine_measure(v1, v2):
prod = dot_product(v1, v2)
len1 = math.sqrt(dot_product(v1, v1))
len2 = math.sqrt(dot_product(v2, v2))
return prod / (len1 * len2)
:
skewed ( , )
discriminative (ex: list(' '))
?
27 / 69
Neural embeddings
: !
(language model)
ex: [" ", " ", " ", " "] ? (ex: "!", "." "?")
, .
Neural network language model (NNLM)*
* Bengio, Ducharme, Vincent, A Neural Probabilistic Language Model, {NIPS, 2000; JMLR, 2003}. 28 / 69
word2vec *
T. Mikolov word2vec neural embedding
! (n -> m )
* Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their
Compositionality. In Proceedings of NIPS, 2013. 29 / 69
: word2vec *
* , , , (1946-2015) , 49 2 , 2015. 30 / 69
neural embedding ?
31 / 69
doc2vec *
: ( ) !
.
.
|V|
* Le, Mikolov, Distributed Representations of Sentences and Documents, ICML, 2014. ( doc2vec paragraph vector
) 32 / 69
: Term frequency -> . .
documents = [
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' ']
]
words = set(sum(documents, []))
print(words)
# => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '}
def term_frequency(document):
# do something
return vector
vectors = [term_frequency(doc) for doc in documents]
print(vectors)
# => {
# 'S1': [1, 1, 0, 1, 0, 0, 1, 1],
# 'S2': [1, 1, 0, 1, 0, 0, 1, 1],
# 'S3': [1, 0, 0, 1, 1, 1, 1, 0],
# 'S4': [1, 1, 0, 1, 1, 1, 0, 0],
# 'S5': [1, 0, 1, 1, 1, 0, 1, 0],
# 'S6': [1, 1, 1, 1, 1, 0, 0, 0]
# }
|V|
33 / 69
: doc2vec -> . .
documents = [
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' ']
]
words = set(sum(documents, []))
print(words)
# => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '}
def doc2vec(document):
# do something else ( Gensim !)
return vector
vectors = [doc2vec(doc) for doc in documents]
print(vectors)
# => {
# 'S1': [-0.01599941, 0.02135301, 0.0927715],
# 'S2': [0.02333261, 0.00211833, 0.00254255],
# 'S3': [-0.00117272, -0.02043504, 0.00593186],
# 'S4': [0.0089237, -0.00397806, -0.13199195],
# 'S5': [0.02555059, 0.01561624, 0.03603476],
# 'S6': [-0.02114407, -0.01552016, 0.01289922]
# }
|V|
34 / 69
.
:
Sparse, long vectors Dense, short vectors
1-hot-vectors word2vec
bag-of-words
(term existance, TF, TF-IDF )
doc2vec
.
35 / 69
Sentiment analysis for Korean movie reviews
36 / 69
, ?
37 / 69
!
38 / 69
Naver sentiment movie corpus v1.0
Maas et al., 2011
:
100 140 ( ' ')
20 ( 64 )
ratings_train.txt: 15 , ratings_test.txt: 5
/ (i.e., random guess yields 50%
accuracy)
: 19MB
URL: http://github.com/e9t/nsmc/
39 / 69
README coming soon
40 / 69
$ cat ratings_train.txt | head -n 25
id document label
9976970 .. 0
3819312 ... .... 1
10265843 0
9045019 .. .. 0
6483659 !
5403919 3 1 8 . ... . 0
7797314 . 0
9443947 ..
7156791 1
5912145 ? .. ?
9008700 . ♥ 1
10217543 90 !! ~
5957425 0
8628627 . . .
9864035
6852435 1
9143163 .
4891476 0
7465483 !!♥
3989148 , . . 1
4581211 . 1
2718894 1
9705777 . ....
471131 . 1
41 / 69
, ?
1. KoNLPy
2. NLTK
3. Bag-of-words(term existence) classify
4. Gensim doc2vec classify
42 / 69
1. Data preprocessing (feat. KoNLPy)
def read_data(filename):
with open(filename, 'r') as f:
data = [line.split('t') for line in f.read().splitlines()]
data = data[1:] # header
return data
train_data = read_data('ratings_train.txt')
test_data = read_data('ratings_test.txt')
NOTE: pandas.read_csv() quoting parameter
csv.QUOTE_NONE (3) (ex:
)
43 / 69
1. Data preprocessing (feat. KoNLPy)
# row, column
print(len(train_data)) # nrows: 150000
print(len(train_data[0])) # ncols: 3
print(len(test_data)) # nrows: 50000
print(len(test_data[0])) # ncols: 3
44 / 69
1. Data preprocessing (feat. KoNLPy)
from konlpy.tag import Twitter
pos_tagger = Twitter()
def tokenize(doc):
# norm, stem optional
return ['/'.join(t) for t in pos_tagger.pos(doc, norm=True, stem=True)]
train_docs = [(tokenize(row[1]), row[2]) for row in train_data]
test_docs = [(tokenize(row[1]), row[2]) for row in test_data]
#
from pprint import pprint
pprint(train_docs[0])
# => [([' /Exclamation',
# ' /Noun',
# '../Punctuation',
# ' /Noun',
# ' /Noun',
# ' /Verb',
# ' /Noun'],
# '0')]
NOTE:
. 45 / 69
1. Data preprocessing (feat. KoNLPy)
" ?"
, x
*
" (POS) ?"
ex: ' /Josa' vs ' /Noun'
* https://github.com/karpathy/char-rnn 46 / 69
2. Data exploration (feat. NLTK)
!
Training data token
tokens = [t for d in train_docs for t in d[0]]
print(len(tokens))
# => 2194536
47 / 69
2. Data exploration (feat. NLTK)
nltk.Text() ! (Exploration )
document ,
.
import nltk
text = nltk.Text(tokens, name='NMSC')
print(text)
# => <Text: NMSC>
48 / 69
2. Data exploration (feat. NLTK)
Count tokens
print(len(text.tokens)) # returns number of tokens
# => 2194536
print(len(set(text.tokens))) # returns number of unique tokens
# => 48765
pprint(text.vocab().most_common(10)) # returns frequency distribution
# => [('./Punctuation', 68630),
# (' /Noun', 51365),
# (' /Verb', 50281),
# (' /Josa', 39123),
# (' /Verb', 34764),
# (' /Josa', 30480),
# ('../Punctuation', 29055),
# (' /Josa', 27108),
# (' /Josa', 26696),
# (' /Josa', 23481)]
49 / 69
2. Data exploration (feat. NLTK)
Plot tokens by frequency
text.plot(50) # Plot sorted frequency of top 50 tokens
;;
50 / 69
2. Data exploration (feat. NLTK)
!
Troubleshooting: For those who see rectangles instead of letters in the saved plot file,
include the following configurations before drawing the plot:
Some example fonts:
Mac OS: /Library/Fonts/AppleGothic.ttf
from matplotlib import font_manager, rc
font_fname = 'c:/windows/fonts/gulim.ttc' # A font of your choice
font_name = font_manager.FontProperties(fname=font_fname).get_name()
rc('font', family=font_name)
51 / 69
2. Data exploration (feat. NLTK)
. .
.
52 / 69
2. Data exploration (feat. NLTK)
Collocations ( ): (ex: "text" + "mining")
text.collocations() #
# => /Determiner /Noun; /Suffix /Josa; /Determiner /Noun; /Noun
# /Verb; /Noun /Josa; 10/Number /Noun; /Noun /Suffix; /Noun
# /Adjective; /Noun /Josa; /Noun /Josa; /Noun /Josa; /Suffix
# /Josa; /Noun /Noun; /Noun /Josa; /Suffix /Josa; /Noun
# /Josa; /Noun /Josa; /Suffix /Josa; /Noun /Suffix; /Noun
# /Josa
NOTE: nltk.Text() , source code API .
.
53 / 69
3. Sentiment classification with term-existance*
term .
# 2000
# WARNING: time/memory efficient
selected_words = [f[0] for f in text.vocab().most_common(2000)]
def term_exists(doc):
return {'exists({})'.format(word): (word in set(doc)) for word in selected_words}
# training corpus
train_docs = train_docs[:10000]
train_xy = [(term_exists(d), c) for d, c in train_docs]
test_xy = [(term_exists(d), c) for d, c in test_docs]
Try this:
. 50 ?
TF, TF-IDF ?
(sparse matrix) or scikit-learn TfidfVectorizer() !
* Natural Language Processing with Python Chapter 6. Learning to classify text 54 / 69
3. Sentiment classification with term-existance
. !
.
NLTK :
Naive Bayes Classifiers
Decision Tree Classifiers
Maximum Entropy Classifiers
.
scikit-learn .
NLTK Naive Bayes Classifier !
55 / 69
3. Sentiment classification with term-existance
Naive Bayes classifier
#
classifier = nltk.NaiveBayesClassifier.train(train_xy)
print(nltk.classify.accuracy(classifier, test_xy))
# => 0.80418
classifier.show_most_informative_features(10)
# => Most Informative Features
# exists( /Noun) = True 1 : 0 = 38.0 : 1.0
# exists( /Noun) = True 0 : 1 = 30.1 : 1.0
# exists(♥/Foreign) = True 1 : 0 = 24.5 : 1.0
# exists( /Noun) = True 0 : 1 = 22.1 : 1.0
# exists( /Noun) = True 0 : 1 = 19.5 : 1.0
# exists( /Noun) = True 0 : 1 = 19.4 : 1.0
# exists( /Noun) = True 1 : 0 = 18.9 : 1.0
# exists( /Noun) = True 0 : 1 = 16.9 : 1.0
# exists( /Noun) = True 1 : 0 = 16.9 : 1.0
# exists( /Noun) = True 1 : 0 = 15.9 : 1.0
/ baseline accuracy 0.50
1 accuracy 0.80 .
?
56 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
doc2vec / .
Gensim . *
: Radim Rehureck
* Gensim 0.12.0 . ! 57 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
!!!
Gensim ...
58 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
doc2vec .
doc2vec
from gensim.models.doc2vec import TaggedDocument OK
from collections import namedtuple
TaggedDocument = namedtuple('TaggedDocument', 'words tags')
# 15 training documents
tagged_train_docs = [TaggedDocument(d, [c]) for d, c in train_docs]
tagged_test_docs = [TaggedDocument(d, [c]) for d, c in test_docs]
59 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
from gensim.models import doc2vec
#
doc_vectorizer = doc2vec.Doc2Vec(size=300, alpha=0.025, min_alpha=0.025, seed=1234)
doc_vectorizer.build_vocab(tagged_train_docs)
# Train document vectors!
for epoch in range(10):
doc_vectorizer.train(tagged_train_docs)
doc_vectorizer.alpha -= 0.002 # decrease the learning rate
doc_vectorizer.min_alpha = doc_vectorizer.alpha # fix the learning rate, no decay
# To save
# doc_vectorizer.save('doc2vec.model')
Try this:
Vector size, epoch
60 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
! (1/3)
pprint(doc_vectorizer.most_similar(' /Noun'))
# => [(' /Noun', 0.5669919848442078),
# (' /Noun', 0.5522832274436951),
# (' /Noun', 0.5021427869796753),
# (' /Noun', 0.5000861287117004),
# (' /Noun', 0.4368450343608856),
# (' /Noun', 0.42848479747772217),
# (' /Noun', 0.42714330554008484),
# (' /Noun', 0.41590073704719543),
# (' /Noun', 0.41056352853775024),
# (' /Noun', 0.4052993059158325)]
... .
61 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
! (2/3)
pprint(doc_vectorizer.most_similar(' /KoreanParticle'))
# => [(' /KoreanParticle', 0.5768033862113953),
# (' /KoreanParticle', 0.4822016954421997),
# ('!!!!/Punctuation', 0.4395076632499695),
# ('!!!/Punctuation', 0.4077949523925781),
# ('!!/Punctuation', 0.4026390314102173),
# ('~/Punctuation', 0.40038347244262695),
# ('~~/Punctuation', 0.39946430921554565),
# ('!/Punctuation', 0.3899948000907898),
# ('^^/Punctuation', 0.3852730989456177),
# ('~^^/Punctuation', 0.3797937035560608)]
! .
62 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
, v(KING) – v(MAN) + v(WOMAN) = v(QUEEN)
?
63 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
! (3/3)
pprint(doc_vectorizer.most_similar(positive=[' /Noun', ' /Noun'], negative=[' /Noun']))
# => [(' /Noun', 0.32974398136138916),
# (' /Noun', 0.305545836687088),
# (' /Noun', 0.2899821400642395),
# (' /Noun', 0.2856029272079468),
# (' /Noun', 0.2840743064880371),
# (' /Noun', 0.28247544169425964),
# (' /Noun', 0.2795347571372986),
# (' /Noun', 0.2794691324234009),
# (' /Noun', 0.27567577362060547),
# (' /Noun', 0.2734225392341614)]
...? ?
64 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
, ;;
, . ? *
text.concordance(' /Noun', lines=10)
# => Displaying 10 of 145 matches:
# Josa /Noun /Josa ,,/Punctuation /Noun /Noun ...../Punctuation /Noun
# /Noun /Noun ../Punctuation /Noun /Noun /Noun /Noun /Noun /Josa /Nou
# nction /Noun /Josa /Adjective /Noun /Verb /Verb /Noun /Josa
# /Noun /Noun /Noun ?/Punctuation /Noun /Noun ./Punctuation /Noun /No
# b /Noun /Noun /KoreanParticle /Noun /Noun /Adjective ./Punctuation
# osa /Noun /Josa /Noun /Josa /Noun /Josa /Noun /Noun /Josa /Nou
# /Noun /Josa /Noun ./Punctuation /Noun ,/Punctuation /Noun /Noun /Suf
# Josa /Noun /Noun /Noun /Noun /Noun /Josa /Noun /Suffix /Josa /
# /Verb ./Punctuation /Adjective /Noun /Noun .../Punctuation /Noun
# tive /Noun /Noun .../Punctuation /Noun /Noun ../Punctuation /Noun /J
* Huang, Socher, Improving Word Representations via Global Context and Multiple Word Prototypes, ACL, 2012. 65 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
. .
train_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_train_docs]
train_y = [doc.tags[0] for doc in tagged_train_docs]
len(train_x) # term existance
# => 150000
len(train_x[0])
# => 300
test_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_test_docs]
test_y = [doc.tags[0] for doc in tagged_test_docs]
len(test_x)
# => 50000
len(test_x[0])
# => 300
66 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
scikit-learn LogisticRegression() .
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=1234)
classifier.fit(train_x, train_y)
classifier.score(test_x, test_y)
# => 0.78246000000000004
! Accuracy 0.78 .
Term existance accuracy 2% , .
, term existance , doc2vec
!
67 / 69
.
KoNLPy , NLTK , Gensim
.
Term existance document vector .
task apply
!
/ *
/
!
* , , , : , SNUDM Technical Reports, Apr 14,
2015. 68 / 69
:D
http://lucypark.kr
@echojuliett
69 / 69
1 of 69

Recommended

パターン認識と機械学習14章 d0912 by
パターン認識と機械学習14章 d0912パターン認識と機械学習14章 d0912
パターン認識と機械学習14章 d0912Yoshinori Kabeya
1.3K views7 slides
Fractions mixed number by
Fractions mixed numberFractions mixed number
Fractions mixed numberNurul Naiemah
1.1K views15 slides
14 4 secant-tangent angles by
14 4 secant-tangent angles14 4 secant-tangent angles
14 4 secant-tangent anglesgwilson8786
2K views6 slides
Prime Factorisation by
Prime FactorisationPrime Factorisation
Prime FactorisationSwift Sonya
3.6K views10 slides
Division by
DivisionDivision
DivisionJohdener14
571 views39 slides
1 2 solving multi-step equations by
1 2 solving multi-step equations1 2 solving multi-step equations
1 2 solving multi-step equationshisema01
1.4K views30 slides

More Related Content

What's hot

Comparing and Ordering of Numbers by
Comparing and Ordering of NumbersComparing and Ordering of Numbers
Comparing and Ordering of NumbersJohdener14
1K views25 slides
3.3節 変分近似法(前半) by
3.3節 変分近似法(前半)3.3節 変分近似法(前半)
3.3節 変分近似法(前半)tn1031
12.7K views21 slides
Hcf &amp; lcm worksheet by
Hcf &amp; lcm worksheetHcf &amp; lcm worksheet
Hcf &amp; lcm worksheetsameen21
2.9K views1 slide
Factor & Multiples by
Factor & MultiplesFactor & Multiples
Factor & MultiplesNicholas Ee
8.3K views32 slides
Discrete mathematics Ch1 sets Theory_Dr.Khaled.Bakro د. خالد بكرو by
Discrete mathematics Ch1 sets Theory_Dr.Khaled.Bakro د. خالد بكروDiscrete mathematics Ch1 sets Theory_Dr.Khaled.Bakro د. خالد بكرو
Discrete mathematics Ch1 sets Theory_Dr.Khaled.Bakro د. خالد بكروDr. Khaled Bakro
3.3K views24 slides
Common Multiples and Common Factors by
Common Multiples and Common FactorsCommon Multiples and Common Factors
Common Multiples and Common FactorsTim Bonnar
23.7K views13 slides

What's hot(20)

Comparing and Ordering of Numbers by Johdener14
Comparing and Ordering of NumbersComparing and Ordering of Numbers
Comparing and Ordering of Numbers
Johdener141K views
3.3節 変分近似法(前半) by tn1031
3.3節 変分近似法(前半)3.3節 変分近似法(前半)
3.3節 変分近似法(前半)
tn103112.7K views
Hcf &amp; lcm worksheet by sameen21
Hcf &amp; lcm worksheetHcf &amp; lcm worksheet
Hcf &amp; lcm worksheet
sameen212.9K views
Factor & Multiples by Nicholas Ee
Factor & MultiplesFactor & Multiples
Factor & Multiples
Nicholas Ee8.3K views
Discrete mathematics Ch1 sets Theory_Dr.Khaled.Bakro د. خالد بكرو by Dr. Khaled Bakro
Discrete mathematics Ch1 sets Theory_Dr.Khaled.Bakro د. خالد بكروDiscrete mathematics Ch1 sets Theory_Dr.Khaled.Bakro د. خالد بكرو
Discrete mathematics Ch1 sets Theory_Dr.Khaled.Bakro د. خالد بكرو
Dr. Khaled Bakro3.3K views
Common Multiples and Common Factors by Tim Bonnar
Common Multiples and Common FactorsCommon Multiples and Common Factors
Common Multiples and Common Factors
Tim Bonnar23.7K views
Rounding to a Whole Number by Chris James
Rounding to a Whole Number Rounding to a Whole Number
Rounding to a Whole Number
Chris James4.6K views
FRACTION POWER POINT by mboldeniii
FRACTION POWER POINTFRACTION POWER POINT
FRACTION POWER POINT
mboldeniii1.7K views
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」 by moterech
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」
moterech2.9K views
Indices and standard form by masboby
Indices and standard formIndices and standard form
Indices and standard form
masboby3.2K views
ノンパラメトリックベイズ4章クラスタリング by 智文 中野
ノンパラメトリックベイズ4章クラスタリングノンパラメトリックベイズ4章クラスタリング
ノンパラメトリックベイズ4章クラスタリング
智文 中野2.6K views
Divisibility Rules by HeatherNeal
Divisibility RulesDivisibility Rules
Divisibility Rules
HeatherNeal12.6K views
2013.12.26 prml勉強会 線形回帰モデル3.2~3.4 by Takeshi Sakaki
2013.12.26 prml勉強会 線形回帰モデル3.2~3.42013.12.26 prml勉強会 線形回帰モデル3.2~3.4
2013.12.26 prml勉強会 線形回帰モデル3.2~3.4
Takeshi Sakaki5.8K views
グラフィカル Lasso を用いた異常検知 by Yuya Takashina
グラフィカル Lasso を用いた異常検知グラフィカル Lasso を用いた異常検知
グラフィカル Lasso を用いた異常検知
Yuya Takashina15.2K views
Compare, Order, and Round Whole Numbers by Brooke Young
Compare, Order, and Round Whole NumbersCompare, Order, and Round Whole Numbers
Compare, Order, and Round Whole Numbers
Brooke Young10.1K views
Interpolation with unequal interval by Dr. Nirav Vyas
Interpolation with unequal intervalInterpolation with unequal interval
Interpolation with unequal interval
Dr. Nirav Vyas38.5K views
データ解析11 因子分析の応用 by Hirotaka Hachiya
データ解析11 因子分析の応用データ解析11 因子分析の応用
データ解析11 因子分析の応用
Hirotaka Hachiya573 views

Similar to 한국어와 NLTK, Gensim의 만남

Compiler Construction | Lecture 5 | Transformation by Term Rewriting by
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingEelco Visser
448 views175 slides
Natural Language Processing and Python by
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Pythonanntp
8.3K views37 slides
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting by
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingEelco Visser
1.1K views174 slides
Python memory management_v2 by
Python memory management_v2Python memory management_v2
Python memory management_v2Jeffrey Clark
243 views48 slides
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data by
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
5.3K views44 slides
Scala eXchange opening by
Scala eXchange openingScala eXchange opening
Scala eXchange openingMartin Odersky
3K views42 slides

Similar to 한국어와 NLTK, Gensim의 만남(20)

Compiler Construction | Lecture 5 | Transformation by Term Rewriting by Eelco Visser
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Eelco Visser448 views
Natural Language Processing and Python by anntp
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
anntp8.3K views
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting by Eelco Visser
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
Eelco Visser1.1K views
Python memory management_v2 by Jeffrey Clark
Python memory management_v2Python memory management_v2
Python memory management_v2
Jeffrey Clark243 views
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data by Anne Nicolas
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Anne Nicolas5.3K views
Yoyak ScalaDays 2015 by ihji
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015
ihji2.3K views
RNN, LSTM and Seq-2-Seq Models by Emory NLP
RNN, LSTM and Seq-2-Seq ModelsRNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq Models
Emory NLP8.6K views
python chapter 1 by Raghu nath
python chapter 1python chapter 1
python chapter 1
Raghu nath741 views
Python chapter 2 by Raghu nath
Python chapter 2Python chapter 2
Python chapter 2
Raghu nath728 views
Gpu programming with java by Gary Sieling
Gpu programming with javaGpu programming with java
Gpu programming with java
Gary Sieling520 views
Data Exploration and Visualization with R by Yanchang Zhao
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
Yanchang Zhao5.4K views
Class 31: Deanonymizing by David Evans
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: Deanonymizing
David Evans603 views
Declare Your Language: Transformation by Strategic Term Rewriting by Eelco Visser
Declare Your Language: Transformation by Strategic Term RewritingDeclare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term Rewriting
Eelco Visser602 views
Super Advanced Python –act1 by Ke Wei Louis
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1
Ke Wei Louis1.5K views
Lies, Damned Lies, and Substrings by Haseeb Qureshi
Lies, Damned Lies, and SubstringsLies, Damned Lies, and Substrings
Lies, Damned Lies, and Substrings
Haseeb Qureshi784 views

More from Eunjeong (Lucy) Park

파이썬과 커뮤니티와 한국어 오픈데이터 by
파이썬과 커뮤니티와 한국어 오픈데이터파이썬과 커뮤니티와 한국어 오픈데이터
파이썬과 커뮤니티와 한국어 오픈데이터Eunjeong (Lucy) Park
995 views59 slides
The beginner’s guide to 웹 크롤링 (스크래핑) by
The beginner’s guide to 웹 크롤링 (스크래핑)The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)Eunjeong (Lucy) Park
31.9K views62 slides
자바, 미안하다! 파이썬 한국어 NLP by
자바, 미안하다! 파이썬 한국어 NLP자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLPEunjeong (Lucy) Park
43.8K views33 slides
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안 by
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안Eunjeong (Lucy) Park
1K views7 slides
On Semi-Supervised Learning and Beyond by
On Semi-Supervised Learning and BeyondOn Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondEunjeong (Lucy) Park
2.2K views30 slides
Introduction to Data Mining for Newbies by
Introduction to Data Mining for NewbiesIntroduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesEunjeong (Lucy) Park
3.5K views40 slides

More from Eunjeong (Lucy) Park(6)

파이썬과 커뮤니티와 한국어 오픈데이터 by Eunjeong (Lucy) Park
파이썬과 커뮤니티와 한국어 오픈데이터파이썬과 커뮤니티와 한국어 오픈데이터
파이썬과 커뮤니티와 한국어 오픈데이터
The beginner’s guide to 웹 크롤링 (스크래핑) by Eunjeong (Lucy) Park
The beginner’s guide to 웹 크롤링 (스크래핑)The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)
Eunjeong (Lucy) Park31.9K views
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안 by Eunjeong (Lucy) Park
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안

Recently uploaded

Best Home Security Systems.pptx by
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptxmogalang
9 views16 slides
Amy slides.pdf by
Amy slides.pdfAmy slides.pdf
Amy slides.pdfStatsCommunications
5 views13 slides
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...DataScienceConferenc1
7 views15 slides
Inawisdom Quick Sight by
Inawisdom Quick SightInawisdom Quick Sight
Inawisdom Quick SightPhilipBasford
7 views27 slides
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOppotus
31 views19 slides
Custom Tag Manager Templates by
Custom Tag Manager TemplatesCustom Tag Manager Templates
Custom Tag Manager TemplatesMarkus Baersch
30 views17 slides

Recently uploaded(20)

Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus31 views
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... by DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
Shreyas hospital statistics.pdf by samithavinal
Shreyas hospital statistics.pdfShreyas hospital statistics.pdf
Shreyas hospital statistics.pdf
samithavinal5 views
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson33 views
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language... by patiladiti752
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
patiladiti7528 views
Customer Data Cleansing Project.pptx by Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 views

한국어와 NLTK, Gensim의 만남

  • 1. NLTK, Gensim ( : ) PyCon Korea 2015 2015-08-29 Lucy Park ( ) 1 / 69
  • 2. KoNLPy , ? ! NLTK , Gensim , . bag-of-words, document embedding , KoNLPy, NLTK, Gensim . KoNLPy: http://konlpy.org NLTK: http://nltk.org Gensim: http://radimrehurek.com/gensim 2 / 69
  • 3. (a.k.a. lucypark, echojuliett, e9t) . CAVEAT: KoNLPy Just another yak shaver... 11:49 <@sanxiyn> 또다시 yak shaving의 신비한 세계 11:51 <@sanxiyn> yak shaving이 뭔지 다 아시죠? 11:51 <디토군> 방금 찾아보고 왔음 11:51 <@mana> (조용히 설명을 기대중) 11:51 <@sanxiyn> 나무를 베려고 하는데 11:52 <@sanxiyn> 도끼질을 하다가 11:52 <@sanxiyn> 도끼가 더 잘 들면 나무를 쉽게 벨텐데 해서 11:52 <@sanxiyn> 도끼 날을 세우다가 11:52 <@sanxiyn> 도끼 가는 돌이 더 좋으면 도끼 날을 더 빨리 세울텐데 해서 11:52 <@sanxiyn> 좋은 숫돌이 있는 곳을 수소문해 보니 11:52 <@mana> … 11:52 <&홍민희> 그거 전형적인 제 행동이네요 11:52 <@sanxiyn> 저 멀이 어디에 세계 최고의 숫돌이 난다고 11:52 <@sanxiyn> 거기까지 야크를 타고 가려다가 11:52 <@mana> 항상하던 짓이라서 타이핑을 할 수 없었습니다 11:52 <@sanxiyn> 야크 털을 깎아서… 11:52 <@sanxiyn> etc. 3 / 69
  • 4. / KoNLPy, NLTK, Gensim (?) ! Gensim , . , . . ? 3 4 / 69
  • 7. . , . . -- Bengio et al., Deep Learning 1 , Book in preparation for MIT Press, 2015. 7 / 69
  • 9. ? , . " " . 9 / 69
  • 11. , , (binary) set one-hot vector 1-of-V = , = , = , . . . =w ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 0 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ( ) ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 1 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 0 1 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 0 0 ⋮ 1 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ wi i |V| 11 / 69
  • 12. . ex: " " " ", " " " " 0 . ( = ( = 0w )⊤ w w )⊤ w * BOW (discrete) (local representation) . (distributed representation) 12 / 69
  • 13. , bag of words(BOW) / ( "term existance" ) n (a.k.a. term frequency (TF)) ex: term frequency–inverse document frequency (TF-IDF) =d1 binary ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 1 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ =d1 tf ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 3 1 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 13 / 69
  • 14. .* . . " ". * : Vincent Spruyt, The Curse of Dimensionality in classification, 2014. 14 / 69
  • 15. " ?" (WordNet) X is-a Y (hypernym) (DAG; directed acyclic graph) . ( : .) from nltk.corpus import wordnet as wn wn.synsets('computer') # "computer" #=> [Synset('computer.n.01'), Synset('calculator.n.01')] : ( ) / 15 / 69
  • 16. : 1. , . from nltk.corpus import wordnet as wn wn.synsets('nginx') #=> [] 2. . wn.synsets('yolo') # : you only live once #=> [] 3. NLTK ... ( !) wn.synsets(' ', lang='kor') #=> WordNetError: Language is not supported. 16 / 69
  • 17. .. . ? 17 / 69
  • 19. . You shall know a word by the company it keeps. -- J.R. Firth (1957) 19 / 69
  • 20. 1: ? __ . , __ . __ . : [ ] 20 / 69
  • 21. 2: " " ? ? 21 / 69
  • 24. , (context) . (context) := (window) / 1-8 , 3-17 * " " ex: "PyCon" "R" "Python" ? * Jurafsky, Martin, Speech and Language Processing 3rd Ed. 19.1 , Book in preparation, 2015. 24 / 69
  • 26. Co-occurrence documents = [[" ", " ", " ", " "], [" ", "R", " ", " "]] 1. Term-document matrix ( ): Computation (!) x_td = [[1, 1], # [1, 0], # [0, 1], # R [1, 1], # [1, 1]] # 1. Term-term matrix ( ): syntactic, semantic ( ==5): x_tt = [[0, 1, 1, 2, 0], # [1, 0, 0, 1, 1], # [1, 0, 0, 1, 1], # R [2, 1, 1, 0, 2], # [0, 1, 1, 2, 0]] # ∈Xtd ℝ|V|×M M ∈Xtt ℝ|V|×|V| 26 / 69
  • 27. Co-occurrence matrix . import math def dot_product(v1, v2): return sum(map(lambda x: x[0] * x[1], zip(v1, v2))) def cosine_measure(v1, v2): prod = dot_product(v1, v2) len1 = math.sqrt(dot_product(v1, v1)) len2 = math.sqrt(dot_product(v2, v2)) return prod / (len1 * len2) : skewed ( , ) discriminative (ex: list(' ')) ? 27 / 69
  • 28. Neural embeddings : ! (language model) ex: [" ", " ", " ", " "] ? (ex: "!", "." "?") , . Neural network language model (NNLM)* * Bengio, Ducharme, Vincent, A Neural Probabilistic Language Model, {NIPS, 2000; JMLR, 2003}. 28 / 69
  • 29. word2vec * T. Mikolov word2vec neural embedding ! (n -> m ) * Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. 29 / 69
  • 30. : word2vec * * , , , (1946-2015) , 49 2 , 2015. 30 / 69
  • 32. doc2vec * : ( ) ! . . |V| * Le, Mikolov, Distributed Representations of Sentences and Documents, ICML, 2014. ( doc2vec paragraph vector ) 32 / 69
  • 33. : Term frequency -> . . documents = [ [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '] ] words = set(sum(documents, [])) print(words) # => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '} def term_frequency(document): # do something return vector vectors = [term_frequency(doc) for doc in documents] print(vectors) # => { # 'S1': [1, 1, 0, 1, 0, 0, 1, 1], # 'S2': [1, 1, 0, 1, 0, 0, 1, 1], # 'S3': [1, 0, 0, 1, 1, 1, 1, 0], # 'S4': [1, 1, 0, 1, 1, 1, 0, 0], # 'S5': [1, 0, 1, 1, 1, 0, 1, 0], # 'S6': [1, 1, 1, 1, 1, 0, 0, 0] # } |V| 33 / 69
  • 34. : doc2vec -> . . documents = [ [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '] ] words = set(sum(documents, [])) print(words) # => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '} def doc2vec(document): # do something else ( Gensim !) return vector vectors = [doc2vec(doc) for doc in documents] print(vectors) # => { # 'S1': [-0.01599941, 0.02135301, 0.0927715], # 'S2': [0.02333261, 0.00211833, 0.00254255], # 'S3': [-0.00117272, -0.02043504, 0.00593186], # 'S4': [0.0089237, -0.00397806, -0.13199195], # 'S5': [0.02555059, 0.01561624, 0.03603476], # 'S6': [-0.02114407, -0.01552016, 0.01289922] # } |V| 34 / 69
  • 35. . : Sparse, long vectors Dense, short vectors 1-hot-vectors word2vec bag-of-words (term existance, TF, TF-IDF ) doc2vec . 35 / 69
  • 36. Sentiment analysis for Korean movie reviews 36 / 69
  • 37. , ? 37 / 69
  • 39. Naver sentiment movie corpus v1.0 Maas et al., 2011 : 100 140 ( ' ') 20 ( 64 ) ratings_train.txt: 15 , ratings_test.txt: 5 / (i.e., random guess yields 50% accuracy) : 19MB URL: http://github.com/e9t/nsmc/ 39 / 69
  • 41. $ cat ratings_train.txt | head -n 25 id document label 9976970 .. 0 3819312 ... .... 1 10265843 0 9045019 .. .. 0 6483659 ! 5403919 3 1 8 . ... . 0 7797314 . 0 9443947 .. 7156791 1 5912145 ? .. ? 9008700 . ♥ 1 10217543 90 !! ~ 5957425 0 8628627 . . . 9864035 6852435 1 9143163 . 4891476 0 7465483 !!♥ 3989148 , . . 1 4581211 . 1 2718894 1 9705777 . .... 471131 . 1 41 / 69
  • 42. , ? 1. KoNLPy 2. NLTK 3. Bag-of-words(term existence) classify 4. Gensim doc2vec classify 42 / 69
  • 43. 1. Data preprocessing (feat. KoNLPy) def read_data(filename): with open(filename, 'r') as f: data = [line.split('t') for line in f.read().splitlines()] data = data[1:] # header return data train_data = read_data('ratings_train.txt') test_data = read_data('ratings_test.txt') NOTE: pandas.read_csv() quoting parameter csv.QUOTE_NONE (3) (ex: ) 43 / 69
  • 44. 1. Data preprocessing (feat. KoNLPy) # row, column print(len(train_data)) # nrows: 150000 print(len(train_data[0])) # ncols: 3 print(len(test_data)) # nrows: 50000 print(len(test_data[0])) # ncols: 3 44 / 69
  • 45. 1. Data preprocessing (feat. KoNLPy) from konlpy.tag import Twitter pos_tagger = Twitter() def tokenize(doc): # norm, stem optional return ['/'.join(t) for t in pos_tagger.pos(doc, norm=True, stem=True)] train_docs = [(tokenize(row[1]), row[2]) for row in train_data] test_docs = [(tokenize(row[1]), row[2]) for row in test_data] # from pprint import pprint pprint(train_docs[0]) # => [([' /Exclamation', # ' /Noun', # '../Punctuation', # ' /Noun', # ' /Noun', # ' /Verb', # ' /Noun'], # '0')] NOTE: . 45 / 69
  • 46. 1. Data preprocessing (feat. KoNLPy) " ?" , x * " (POS) ?" ex: ' /Josa' vs ' /Noun' * https://github.com/karpathy/char-rnn 46 / 69
  • 47. 2. Data exploration (feat. NLTK) ! Training data token tokens = [t for d in train_docs for t in d[0]] print(len(tokens)) # => 2194536 47 / 69
  • 48. 2. Data exploration (feat. NLTK) nltk.Text() ! (Exploration ) document , . import nltk text = nltk.Text(tokens, name='NMSC') print(text) # => <Text: NMSC> 48 / 69
  • 49. 2. Data exploration (feat. NLTK) Count tokens print(len(text.tokens)) # returns number of tokens # => 2194536 print(len(set(text.tokens))) # returns number of unique tokens # => 48765 pprint(text.vocab().most_common(10)) # returns frequency distribution # => [('./Punctuation', 68630), # (' /Noun', 51365), # (' /Verb', 50281), # (' /Josa', 39123), # (' /Verb', 34764), # (' /Josa', 30480), # ('../Punctuation', 29055), # (' /Josa', 27108), # (' /Josa', 26696), # (' /Josa', 23481)] 49 / 69
  • 50. 2. Data exploration (feat. NLTK) Plot tokens by frequency text.plot(50) # Plot sorted frequency of top 50 tokens ;; 50 / 69
  • 51. 2. Data exploration (feat. NLTK) ! Troubleshooting: For those who see rectangles instead of letters in the saved plot file, include the following configurations before drawing the plot: Some example fonts: Mac OS: /Library/Fonts/AppleGothic.ttf from matplotlib import font_manager, rc font_fname = 'c:/windows/fonts/gulim.ttc' # A font of your choice font_name = font_manager.FontProperties(fname=font_fname).get_name() rc('font', family=font_name) 51 / 69
  • 52. 2. Data exploration (feat. NLTK) . . . 52 / 69
  • 53. 2. Data exploration (feat. NLTK) Collocations ( ): (ex: "text" + "mining") text.collocations() # # => /Determiner /Noun; /Suffix /Josa; /Determiner /Noun; /Noun # /Verb; /Noun /Josa; 10/Number /Noun; /Noun /Suffix; /Noun # /Adjective; /Noun /Josa; /Noun /Josa; /Noun /Josa; /Suffix # /Josa; /Noun /Noun; /Noun /Josa; /Suffix /Josa; /Noun # /Josa; /Noun /Josa; /Suffix /Josa; /Noun /Suffix; /Noun # /Josa NOTE: nltk.Text() , source code API . . 53 / 69
  • 54. 3. Sentiment classification with term-existance* term . # 2000 # WARNING: time/memory efficient selected_words = [f[0] for f in text.vocab().most_common(2000)] def term_exists(doc): return {'exists({})'.format(word): (word in set(doc)) for word in selected_words} # training corpus train_docs = train_docs[:10000] train_xy = [(term_exists(d), c) for d, c in train_docs] test_xy = [(term_exists(d), c) for d, c in test_docs] Try this: . 50 ? TF, TF-IDF ? (sparse matrix) or scikit-learn TfidfVectorizer() ! * Natural Language Processing with Python Chapter 6. Learning to classify text 54 / 69
  • 55. 3. Sentiment classification with term-existance . ! . NLTK : Naive Bayes Classifiers Decision Tree Classifiers Maximum Entropy Classifiers . scikit-learn . NLTK Naive Bayes Classifier ! 55 / 69
  • 56. 3. Sentiment classification with term-existance Naive Bayes classifier # classifier = nltk.NaiveBayesClassifier.train(train_xy) print(nltk.classify.accuracy(classifier, test_xy)) # => 0.80418 classifier.show_most_informative_features(10) # => Most Informative Features # exists( /Noun) = True 1 : 0 = 38.0 : 1.0 # exists( /Noun) = True 0 : 1 = 30.1 : 1.0 # exists(♥/Foreign) = True 1 : 0 = 24.5 : 1.0 # exists( /Noun) = True 0 : 1 = 22.1 : 1.0 # exists( /Noun) = True 0 : 1 = 19.5 : 1.0 # exists( /Noun) = True 0 : 1 = 19.4 : 1.0 # exists( /Noun) = True 1 : 0 = 18.9 : 1.0 # exists( /Noun) = True 0 : 1 = 16.9 : 1.0 # exists( /Noun) = True 1 : 0 = 16.9 : 1.0 # exists( /Noun) = True 1 : 0 = 15.9 : 1.0 / baseline accuracy 0.50 1 accuracy 0.80 . ? 56 / 69
  • 57. 4. Sentiment classification with doc2vec (feat. Gensim) doc2vec / . Gensim . * : Radim Rehureck * Gensim 0.12.0 . ! 57 / 69
  • 58. 4. Sentiment classification with doc2vec (feat. Gensim) !!! Gensim ... 58 / 69
  • 59. 4. Sentiment classification with doc2vec (feat. Gensim) doc2vec . doc2vec from gensim.models.doc2vec import TaggedDocument OK from collections import namedtuple TaggedDocument = namedtuple('TaggedDocument', 'words tags') # 15 training documents tagged_train_docs = [TaggedDocument(d, [c]) for d, c in train_docs] tagged_test_docs = [TaggedDocument(d, [c]) for d, c in test_docs] 59 / 69
  • 60. 4. Sentiment classification with doc2vec (feat. Gensim) from gensim.models import doc2vec # doc_vectorizer = doc2vec.Doc2Vec(size=300, alpha=0.025, min_alpha=0.025, seed=1234) doc_vectorizer.build_vocab(tagged_train_docs) # Train document vectors! for epoch in range(10): doc_vectorizer.train(tagged_train_docs) doc_vectorizer.alpha -= 0.002 # decrease the learning rate doc_vectorizer.min_alpha = doc_vectorizer.alpha # fix the learning rate, no decay # To save # doc_vectorizer.save('doc2vec.model') Try this: Vector size, epoch 60 / 69
  • 61. 4. Sentiment classification with doc2vec (feat. Gensim) ! (1/3) pprint(doc_vectorizer.most_similar(' /Noun')) # => [(' /Noun', 0.5669919848442078), # (' /Noun', 0.5522832274436951), # (' /Noun', 0.5021427869796753), # (' /Noun', 0.5000861287117004), # (' /Noun', 0.4368450343608856), # (' /Noun', 0.42848479747772217), # (' /Noun', 0.42714330554008484), # (' /Noun', 0.41590073704719543), # (' /Noun', 0.41056352853775024), # (' /Noun', 0.4052993059158325)] ... . 61 / 69
  • 62. 4. Sentiment classification with doc2vec (feat. Gensim) ! (2/3) pprint(doc_vectorizer.most_similar(' /KoreanParticle')) # => [(' /KoreanParticle', 0.5768033862113953), # (' /KoreanParticle', 0.4822016954421997), # ('!!!!/Punctuation', 0.4395076632499695), # ('!!!/Punctuation', 0.4077949523925781), # ('!!/Punctuation', 0.4026390314102173), # ('~/Punctuation', 0.40038347244262695), # ('~~/Punctuation', 0.39946430921554565), # ('!/Punctuation', 0.3899948000907898), # ('^^/Punctuation', 0.3852730989456177), # ('~^^/Punctuation', 0.3797937035560608)] ! . 62 / 69
  • 63. 4. Sentiment classification with doc2vec (feat. Gensim) , v(KING) – v(MAN) + v(WOMAN) = v(QUEEN) ? 63 / 69
  • 64. 4. Sentiment classification with doc2vec (feat. Gensim) ! (3/3) pprint(doc_vectorizer.most_similar(positive=[' /Noun', ' /Noun'], negative=[' /Noun'])) # => [(' /Noun', 0.32974398136138916), # (' /Noun', 0.305545836687088), # (' /Noun', 0.2899821400642395), # (' /Noun', 0.2856029272079468), # (' /Noun', 0.2840743064880371), # (' /Noun', 0.28247544169425964), # (' /Noun', 0.2795347571372986), # (' /Noun', 0.2794691324234009), # (' /Noun', 0.27567577362060547), # (' /Noun', 0.2734225392341614)] ...? ? 64 / 69
  • 65. 4. Sentiment classification with doc2vec (feat. Gensim) , ;; , . ? * text.concordance(' /Noun', lines=10) # => Displaying 10 of 145 matches: # Josa /Noun /Josa ,,/Punctuation /Noun /Noun ...../Punctuation /Noun # /Noun /Noun ../Punctuation /Noun /Noun /Noun /Noun /Noun /Josa /Nou # nction /Noun /Josa /Adjective /Noun /Verb /Verb /Noun /Josa # /Noun /Noun /Noun ?/Punctuation /Noun /Noun ./Punctuation /Noun /No # b /Noun /Noun /KoreanParticle /Noun /Noun /Adjective ./Punctuation # osa /Noun /Josa /Noun /Josa /Noun /Josa /Noun /Noun /Josa /Nou # /Noun /Josa /Noun ./Punctuation /Noun ,/Punctuation /Noun /Noun /Suf # Josa /Noun /Noun /Noun /Noun /Noun /Josa /Noun /Suffix /Josa / # /Verb ./Punctuation /Adjective /Noun /Noun .../Punctuation /Noun # tive /Noun /Noun .../Punctuation /Noun /Noun ../Punctuation /Noun /J * Huang, Socher, Improving Word Representations via Global Context and Multiple Word Prototypes, ACL, 2012. 65 / 69
  • 66. 4. Sentiment classification with doc2vec (feat. Gensim) . . train_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_train_docs] train_y = [doc.tags[0] for doc in tagged_train_docs] len(train_x) # term existance # => 150000 len(train_x[0]) # => 300 test_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_test_docs] test_y = [doc.tags[0] for doc in tagged_test_docs] len(test_x) # => 50000 len(test_x[0]) # => 300 66 / 69
  • 67. 4. Sentiment classification with doc2vec (feat. Gensim) scikit-learn LogisticRegression() . from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state=1234) classifier.fit(train_x, train_y) classifier.score(test_x, test_y) # => 0.78246000000000004 ! Accuracy 0.78 . Term existance accuracy 2% , . , term existance , doc2vec ! 67 / 69
  • 68. . KoNLPy , NLTK , Gensim . Term existance document vector . task apply ! / * / ! * , , , : , SNUDM Technical Reports, Apr 14, 2015. 68 / 69