# 한국어와 NLTK, Gensim의 만남

HTML: http://lucypark.kr/slides/2015-pyconkr 부제: 영화 리뷰를 컴퓨터가 이해할 수 있는 형식으로 표현해서 센티멘트 분석하기

1 of 69

## Recommended

내가 이해하는 SVM(왜, 어떻게를 중심으로)
내가 이해하는 SVM(왜, 어떻게를 중심으로)SANG WON PARK

Ranking algorithms
Ranking algorithmsAnkit Raj

5.3 dynamic programming
5.3 dynamic programmingKrish_ver2

Cubic Spline Interpolation
Cubic Spline InterpolationVARUN KUMAR

## What's hot

Dynamic programming
Dynamic programmingAmit Rathi

P, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-HardAnimesh Chaturvedi

Gram-Schmidt and QR Decomposition (Factorization) of Matrices
Gram-Schmidt and QR Decomposition (Factorization) of MatricesIsaac Yowetu

Ai 그까이거
Ai 그까이거도형 임

Machine Learning lecture2(linear regression)
Machine Learning lecture2(linear regression)cairo university

Differential Equation and its Application in LR circuit
Differential Equation and its Application in LR circuitKrushnaNemade

QR Algorithm Presentation
QR Algorithm Presentationkmwangi

Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Anmol Dwivedi

Euler and improved euler method
Euler and improved euler methodSohaib Butt

Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Simplilearn

3. Linear Algebra for Machine Learning: Factorization and Linear Transformations
3. Linear Algebra for Machine Learning: Factorization and Linear TransformationsCeni Babaoglu, PhD

Lu decomposition
Lu decompositiongilandio

[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)Donghyeon Kim

Gamma beta functions-1
Gamma beta functions-1Selvaraj John

Numerical analysis (Bisectional method) application
Numerical analysis (Bisectional method) applicationMonsur Ahmed Shafiq

Support Vector Machine and Implementation using Weka
Support Vector Machine and Implementation using WekaMacha Pujitha

### What's hot(20)

Runge kutta 2nd Order
Runge kutta 2nd Order

Dynamic programming
Dynamic programming

P, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-Hard

Gram-Schmidt and QR Decomposition (Factorization) of Matrices
Gram-Schmidt and QR Decomposition (Factorization) of Matrices

Ai 그까이거
Ai 그까이거

Machine Learning lecture2(linear regression)
Machine Learning lecture2(linear regression)

Yolo
Yolo

Differential Equation and its Application in LR circuit
Differential Equation and its Application in LR circuit

QR Algorithm Presentation
QR Algorithm Presentation

Fuzzy logic ppt
Fuzzy logic ppt

Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)

Euler and improved euler method
Euler and improved euler method

Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...

3. Linear Algebra for Machine Learning: Factorization and Linear Transformations
3. Linear Algebra for Machine Learning: Factorization and Linear Transformations

Lu decomposition
Lu decomposition

[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)

Gamma beta functions-1
Gamma beta functions-1

Numerical analysis (Bisectional method) application
Numerical analysis (Bisectional method) application

Support Vector Machine and Implementation using Weka
Support Vector Machine and Implementation using Weka

## Similar to 한국어와 NLTK, Gensim의 만남

Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingEelco Visser

Natural Language Processing and Python
Natural Language Processing and Pythonanntp

CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingEelco Visser

Python memory management_v2
Python memory management_v2Jeffrey Clark

Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas

Scala eXchange opening
Scala eXchange openingMartin Odersky

RNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq ModelsEmory NLP

python chapter 1
python chapter 1Raghu nath

Python chapter 2
Python chapter 2Raghu nath

Gpu programming with java
Gpu programming with javaGary Sieling

Data Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao

Class 31: Deanonymizing
Class 31: DeanonymizingDavid Evans

Declare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingEelco Visser

Super Advanced Python –act1Ke Wei Louis

Lies, Damned Lies, and Substrings
Lies, Damned Lies, and SubstringsHaseeb Qureshi

### Similar to 한국어와 NLTK, Gensim의 만남(20)

Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term Rewriting

Natural Language Processing and Python
Natural Language Processing and Python

CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting

Python memory management_v2
Python memory management_v2

Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data

Scala eXchange opening
Scala eXchange opening

RNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq Models

Music as data
Music as data

Alastair Butler - 2015 - Round trips with meaning stopovers
Alastair Butler - 2015 - Round trips with meaning stopovers

python chapter 1
python chapter 1

Python chapter 2
Python chapter 2

Gpu programming with java
Gpu programming with java

Python
Python

Python lecture 05
Python lecture 05

Data Exploration and Visualization with R
Data Exploration and Visualization with R

Class 31: Deanonymizing
Class 31: Deanonymizing

Declare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term Rewriting

Lies, Damned Lies, and Substrings
Lies, Damned Lies, and Substrings

## More from Eunjeong (Lucy) Park

파이썬과 커뮤니티와 한국어 오픈데이터
파이썬과 커뮤니티와 한국어 오픈데이터Eunjeong (Lucy) Park

The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)Eunjeong (Lucy) Park

자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLPEunjeong (Lucy) Park

대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안Eunjeong (Lucy) Park

On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondEunjeong (Lucy) Park

Introduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesEunjeong (Lucy) Park

### More from Eunjeong (Lucy) Park(6)

파이썬과 커뮤니티와 한국어 오픈데이터
파이썬과 커뮤니티와 한국어 오픈데이터

The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)

자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP

대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안

On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and Beyond

Introduction to Data Mining for Newbies
Introduction to Data Mining for Newbies

fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxPoonamRijal

IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfAustraliaChapterIIBA

Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...DrSumathyV

Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelTope Osanyintuyi

Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsDataArchiva

Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Thibaud Le Douarin

ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxHizkiaJastis

What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?Denodo

itc limited word file.pdf...............
itc limited word file.pdf...............mahetamanav24

Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughsNikolas Markou

A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub

Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam

Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...ThinkInnovation

fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptx

IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf

Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...

Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft Excel

Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data Goals

Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...

ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptx

What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?

itc limited word file.pdf...............
itc limited word file.pdf...............

Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs

A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)

Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens

Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx

Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...

### 한국어와 NLTK, Gensim의 만남

• 1. NLTK, Gensim ( : ) PyCon Korea 2015 2015-08-29 Lucy Park ( ) 1 / 69
• 2. KoNLPy , ? ! NLTK , Gensim , . bag-of-words, document embedding , KoNLPy, NLTK, Gensim . KoNLPy: http://konlpy.org NLTK: http://nltk.org Gensim: http://radimrehurek.com/gensim 2 / 69
• 3. (a.k.a. lucypark, echojuliett, e9t) . CAVEAT: KoNLPy Just another yak shaver... 11:49 <@sanxiyn> 또다시 yak shaving의 신비한 세계 11:51 <@sanxiyn> yak shaving이 뭔지 다 아시죠? 11:51 <디토군> 방금 찾아보고 왔음 11:51 <@mana> (조용히 설명을 기대중) 11:51 <@sanxiyn> 나무를 베려고 하는데 11:52 <@sanxiyn> 도끼질을 하다가 11:52 <@sanxiyn> 도끼가 더 잘 들면 나무를 쉽게 벨텐데 해서 11:52 <@sanxiyn> 도끼 날을 세우다가 11:52 <@sanxiyn> 도끼 가는 돌이 더 좋으면 도끼 날을 더 빨리 세울텐데 해서 11:52 <@sanxiyn> 좋은 숫돌이 있는 곳을 수소문해 보니 11:52 <@mana> … 11:52 <&홍민희> 그거 전형적인 제 행동이네요 11:52 <@sanxiyn> 저 멀이 어디에 세계 최고의 숫돌이 난다고 11:52 <@sanxiyn> 거기까지 야크를 타고 가려다가 11:52 <@mana> 항상하던 짓이라서 타이핑을 할 수 없었습니다 11:52 <@sanxiyn> 야크 털을 깎아서… 11:52 <@sanxiyn> etc. 3 / 69
• 4. / KoNLPy, NLTK, Gensim (?) ! Gensim , . , . . ? 3 4 / 69
• 7. . , . . -- Bengio et al., Deep Learning 1 , Book in preparation for MIT Press, 2015. 7 / 69
• 9. ? , . " " . 9 / 69
• 11. , , (binary) set one-hot vector 1-of-V = , = , = , . . . =w ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 0 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ( ) ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 1 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 0 1 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 0 0 ⋮ 1 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ wi i |V| 11 / 69
• 12. . ex: " " " ", " " " " 0 . ( = ( = 0w )⊤ w w )⊤ w * BOW (discrete) (local representation) . (distributed representation) 12 / 69
• 13. , bag of words(BOW) / ( "term existance" ) n (a.k.a. term frequency (TF)) ex: term frequency–inverse document frequency (TF-IDF) =d1 binary ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 1 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ =d1 tf ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 3 1 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 13 / 69
• 14. .* . . " ". * : Vincent Spruyt, The Curse of Dimensionality in classiﬁcation, 2014. 14 / 69
• 15. " ?" (WordNet) X is-a Y (hypernym) (DAG; directed acyclic graph) . ( : .) from nltk.corpus import wordnet as wn wn.synsets('computer') # "computer" #=> [Synset('computer.n.01'), Synset('calculator.n.01')] : ( ) / 15 / 69
• 16. : 1. , . from nltk.corpus import wordnet as wn wn.synsets('nginx') #=> [] 2. . wn.synsets('yolo') # : you only live once #=> [] 3. NLTK ... ( !) wn.synsets(' ', lang='kor') #=> WordNetError: Language is not supported. 16 / 69
• 17. .. . ? 17 / 69
• 19. . You shall know a word by the company it keeps. -- J.R. Firth (1957) 19 / 69
• 20. 1: ? __ . , __ . __ . : [ ] 20 / 69
• 21. 2: " " ? ? 21 / 69
• 24. , (context) . (context) := (window) / 1-8 , 3-17 * " " ex: "PyCon" "R" "Python" ? * Jurafsky, Martin, Speech and Language Processing 3rd Ed. 19.1 , Book in preparation, 2015. 24 / 69
• 26. Co-occurrence documents = [[" ", " ", " ", " "], [" ", "R", " ", " "]] 1. Term-document matrix ( ): Computation (!) x_td = [[1, 1], # [1, 0], # [0, 1], # R [1, 1], # [1, 1]] # 1. Term-term matrix ( ): syntactic, semantic ( ==5): x_tt = [[0, 1, 1, 2, 0], # [1, 0, 0, 1, 1], # [1, 0, 0, 1, 1], # R [2, 1, 1, 0, 2], # [0, 1, 1, 2, 0]] # ∈Xtd ℝ|V|×M M ∈Xtt ℝ|V|×|V| 26 / 69
• 27. Co-occurrence matrix . import math def dot_product(v1, v2): return sum(map(lambda x: x[0] * x[1], zip(v1, v2))) def cosine_measure(v1, v2): prod = dot_product(v1, v2) len1 = math.sqrt(dot_product(v1, v1)) len2 = math.sqrt(dot_product(v2, v2)) return prod / (len1 * len2) : skewed ( , ) discriminative (ex: list(' ')) ? 27 / 69
• 28. Neural embeddings : ! (language model) ex: [" ", " ", " ", " "] ? (ex: "!", "." "?") , . Neural network language model (NNLM)* * Bengio, Ducharme, Vincent, A Neural Probabilistic Language Model, {NIPS, 2000; JMLR, 2003}. 28 / 69
• 29. word2vec * T. Mikolov word2vec neural embedding ! (n -> m ) * Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. 29 / 69
• 30. : word2vec * * , , , (1946-2015) , 49 2 , 2015. 30 / 69
• 32. doc2vec * : ( ) ! . . |V| * Le, Mikolov, Distributed Representations of Sentences and Documents, ICML, 2014. ( doc2vec paragraph vector ) 32 / 69
• 33. : Term frequency -> . . documents = [ [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '] ] words = set(sum(documents, [])) print(words) # => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '} def term_frequency(document): # do something return vector vectors = [term_frequency(doc) for doc in documents] print(vectors) # => { # 'S1': [1, 1, 0, 1, 0, 0, 1, 1], # 'S2': [1, 1, 0, 1, 0, 0, 1, 1], # 'S3': [1, 0, 0, 1, 1, 1, 1, 0], # 'S4': [1, 1, 0, 1, 1, 1, 0, 0], # 'S5': [1, 0, 1, 1, 1, 0, 1, 0], # 'S6': [1, 1, 1, 1, 1, 0, 0, 0] # } |V| 33 / 69
• 34. : doc2vec -> . . documents = [ [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '] ] words = set(sum(documents, [])) print(words) # => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '} def doc2vec(document): # do something else ( Gensim !) return vector vectors = [doc2vec(doc) for doc in documents] print(vectors) # => { # 'S1': [-0.01599941, 0.02135301, 0.0927715], # 'S2': [0.02333261, 0.00211833, 0.00254255], # 'S3': [-0.00117272, -0.02043504, 0.00593186], # 'S4': [0.0089237, -0.00397806, -0.13199195], # 'S5': [0.02555059, 0.01561624, 0.03603476], # 'S6': [-0.02114407, -0.01552016, 0.01289922] # } |V| 34 / 69
• 35. . : Sparse, long vectors Dense, short vectors 1-hot-vectors word2vec bag-of-words (term existance, TF, TF-IDF ) doc2vec . 35 / 69
• 36. Sentiment analysis for Korean movie reviews 36 / 69
• 37. , ? 37 / 69
• 39. Naver sentiment movie corpus v1.0 Maas et al., 2011 : 100 140 ( ' ') 20 ( 64 ) ratings_train.txt: 15 , ratings_test.txt: 5 / (i.e., random guess yields 50% accuracy) : 19MB URL: http://github.com/e9t/nsmc/ 39 / 69
• 41. \$ cat ratings_train.txt | head -n 25 id document label 9976970 .. 0 3819312 ... .... 1 10265843 0 9045019 .. .. 0 6483659 ! 5403919 3 1 8 . ... . 0 7797314 . 0 9443947 .. 7156791 1 5912145 ? .. ? 9008700 . ♥ 1 10217543 90 !! ~ 5957425 0 8628627 . . . 9864035 6852435 1 9143163 . 4891476 0 7465483 !!♥ 3989148 , . . 1 4581211 . 1 2718894 1 9705777 . .... 471131 . 1 41 / 69
• 42. , ? 1. KoNLPy 2. NLTK 3. Bag-of-words(term existence) classify 4. Gensim doc2vec classify 42 / 69
• 43. 1. Data preprocessing (feat. KoNLPy) def read_data(ﬁlename): with open(ﬁlename, 'r') as f: data = [line.split('t') for line in f.read().splitlines()] data = data[1:] # header return data train_data = read_data('ratings_train.txt') test_data = read_data('ratings_test.txt') NOTE: pandas.read_csv() quoting parameter csv.QUOTE_NONE (3) (ex: ) 43 / 69
• 44. 1. Data preprocessing (feat. KoNLPy) # row, column print(len(train_data)) # nrows: 150000 print(len(train_data[0])) # ncols: 3 print(len(test_data)) # nrows: 50000 print(len(test_data[0])) # ncols: 3 44 / 69
• 45. 1. Data preprocessing (feat. KoNLPy) from konlpy.tag import Twitter pos_tagger = Twitter() def tokenize(doc): # norm, stem optional return ['/'.join(t) for t in pos_tagger.pos(doc, norm=True, stem=True)] train_docs = [(tokenize(row[1]), row[2]) for row in train_data] test_docs = [(tokenize(row[1]), row[2]) for row in test_data] # from pprint import pprint pprint(train_docs[0]) # => [([' /Exclamation', # ' /Noun', # '../Punctuation', # ' /Noun', # ' /Noun', # ' /Verb', # ' /Noun'], # '0')] NOTE: . 45 / 69
• 46. 1. Data preprocessing (feat. KoNLPy) " ?" , x * " (POS) ?" ex: ' /Josa' vs ' /Noun' * https://github.com/karpathy/char-rnn 46 / 69
• 47. 2. Data exploration (feat. NLTK) ! Training data token tokens = [t for d in train_docs for t in d[0]] print(len(tokens)) # => 2194536 47 / 69
• 48. 2. Data exploration (feat. NLTK) nltk.Text() ! (Exploration ) document , . import nltk text = nltk.Text(tokens, name='NMSC') print(text) # => <Text: NMSC> 48 / 69
• 49. 2. Data exploration (feat. NLTK) Count tokens print(len(text.tokens)) # returns number of tokens # => 2194536 print(len(set(text.tokens))) # returns number of unique tokens # => 48765 pprint(text.vocab().most_common(10)) # returns frequency distribution # => [('./Punctuation', 68630), # (' /Noun', 51365), # (' /Verb', 50281), # (' /Josa', 39123), # (' /Verb', 34764), # (' /Josa', 30480), # ('../Punctuation', 29055), # (' /Josa', 27108), # (' /Josa', 26696), # (' /Josa', 23481)] 49 / 69
• 50. 2. Data exploration (feat. NLTK) Plot tokens by frequency text.plot(50) # Plot sorted frequency of top 50 tokens ;; 50 / 69
• 51. 2. Data exploration (feat. NLTK) ! Troubleshooting: For those who see rectangles instead of letters in the saved plot ﬁle, include the following conﬁgurations before drawing the plot: Some example fonts: Mac OS: /Library/Fonts/AppleGothic.ttf from matplotlib import font_manager, rc font_fname = 'c:/windows/fonts/gulim.ttc' # A font of your choice font_name = font_manager.FontProperties(fname=font_fname).get_name() rc('font', family=font_name) 51 / 69
• 52. 2. Data exploration (feat. NLTK) . . . 52 / 69
• 53. 2. Data exploration (feat. NLTK) Collocations ( ): (ex: "text" + "mining") text.collocations() # # => /Determiner /Noun; /Sufﬁx /Josa; /Determiner /Noun; /Noun # /Verb; /Noun /Josa; 10/Number /Noun; /Noun /Sufﬁx; /Noun # /Adjective; /Noun /Josa; /Noun /Josa; /Noun /Josa; /Sufﬁx # /Josa; /Noun /Noun; /Noun /Josa; /Sufﬁx /Josa; /Noun # /Josa; /Noun /Josa; /Sufﬁx /Josa; /Noun /Sufﬁx; /Noun # /Josa NOTE: nltk.Text() , source code API . . 53 / 69
• 54. 3. Sentiment classiﬁcation with term-existance* term . # 2000 # WARNING: time/memory efﬁcient selected_words = [f[0] for f in text.vocab().most_common(2000)] def term_exists(doc): return {'exists({})'.format(word): (word in set(doc)) for word in selected_words} # training corpus train_docs = train_docs[:10000] train_xy = [(term_exists(d), c) for d, c in train_docs] test_xy = [(term_exists(d), c) for d, c in test_docs] Try this: . 50 ? TF, TF-IDF ? (sparse matrix) or scikit-learn TﬁdfVectorizer() ! * Natural Language Processing with Python Chapter 6. Learning to classify text 54 / 69
• 55. 3. Sentiment classiﬁcation with term-existance . ! . NLTK : Naive Bayes Classiﬁers Decision Tree Classiﬁers Maximum Entropy Classiﬁers . scikit-learn . NLTK Naive Bayes Classiﬁer ! 55 / 69
• 56. 3. Sentiment classiﬁcation with term-existance Naive Bayes classiﬁer # classiﬁer = nltk.NaiveBayesClassiﬁer.train(train_xy) print(nltk.classify.accuracy(classiﬁer, test_xy)) # => 0.80418 classiﬁer.show_most_informative_features(10) # => Most Informative Features # exists( /Noun) = True 1 : 0 = 38.0 : 1.0 # exists( /Noun) = True 0 : 1 = 30.1 : 1.0 # exists(♥/Foreign) = True 1 : 0 = 24.5 : 1.0 # exists( /Noun) = True 0 : 1 = 22.1 : 1.0 # exists( /Noun) = True 0 : 1 = 19.5 : 1.0 # exists( /Noun) = True 0 : 1 = 19.4 : 1.0 # exists( /Noun) = True 1 : 0 = 18.9 : 1.0 # exists( /Noun) = True 0 : 1 = 16.9 : 1.0 # exists( /Noun) = True 1 : 0 = 16.9 : 1.0 # exists( /Noun) = True 1 : 0 = 15.9 : 1.0 / baseline accuracy 0.50 1 accuracy 0.80 . ? 56 / 69
• 57. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) doc2vec / . Gensim . * : Radim Rehureck * Gensim 0.12.0 . ! 57 / 69
• 58. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) !!! Gensim ... 58 / 69
• 59. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) doc2vec . doc2vec from gensim.models.doc2vec import TaggedDocument OK from collections import namedtuple TaggedDocument = namedtuple('TaggedDocument', 'words tags') # 15 training documents tagged_train_docs = [TaggedDocument(d, [c]) for d, c in train_docs] tagged_test_docs = [TaggedDocument(d, [c]) for d, c in test_docs] 59 / 69
• 60. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) from gensim.models import doc2vec # doc_vectorizer = doc2vec.Doc2Vec(size=300, alpha=0.025, min_alpha=0.025, seed=1234) doc_vectorizer.build_vocab(tagged_train_docs) # Train document vectors! for epoch in range(10): doc_vectorizer.train(tagged_train_docs) doc_vectorizer.alpha -= 0.002 # decrease the learning rate doc_vectorizer.min_alpha = doc_vectorizer.alpha # ﬁx the learning rate, no decay # To save # doc_vectorizer.save('doc2vec.model') Try this: Vector size, epoch 60 / 69
• 61. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) ! (1/3) pprint(doc_vectorizer.most_similar(' /Noun')) # => [(' /Noun', 0.5669919848442078), # (' /Noun', 0.5522832274436951), # (' /Noun', 0.5021427869796753), # (' /Noun', 0.5000861287117004), # (' /Noun', 0.4368450343608856), # (' /Noun', 0.42848479747772217), # (' /Noun', 0.42714330554008484), # (' /Noun', 0.41590073704719543), # (' /Noun', 0.41056352853775024), # (' /Noun', 0.4052993059158325)] ... . 61 / 69
• 62. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) ! (2/3) pprint(doc_vectorizer.most_similar(' /KoreanParticle')) # => [(' /KoreanParticle', 0.5768033862113953), # (' /KoreanParticle', 0.4822016954421997), # ('!!!!/Punctuation', 0.4395076632499695), # ('!!!/Punctuation', 0.4077949523925781), # ('!!/Punctuation', 0.4026390314102173), # ('~/Punctuation', 0.40038347244262695), # ('~~/Punctuation', 0.39946430921554565), # ('!/Punctuation', 0.3899948000907898), # ('^^/Punctuation', 0.3852730989456177), # ('~^^/Punctuation', 0.3797937035560608)] ! . 62 / 69
• 63. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) , v(KING) – v(MAN) + v(WOMAN) = v(QUEEN) ? 63 / 69
• 64. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) ! (3/3) pprint(doc_vectorizer.most_similar(positive=[' /Noun', ' /Noun'], negative=[' /Noun'])) # => [(' /Noun', 0.32974398136138916), # (' /Noun', 0.305545836687088), # (' /Noun', 0.2899821400642395), # (' /Noun', 0.2856029272079468), # (' /Noun', 0.2840743064880371), # (' /Noun', 0.28247544169425964), # (' /Noun', 0.2795347571372986), # (' /Noun', 0.2794691324234009), # (' /Noun', 0.27567577362060547), # (' /Noun', 0.2734225392341614)] ...? ? 64 / 69
• 65. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) , ;; , . ? * text.concordance(' /Noun', lines=10) # => Displaying 10 of 145 matches: # Josa /Noun /Josa ,,/Punctuation /Noun /Noun ...../Punctuation /Noun # /Noun /Noun ../Punctuation /Noun /Noun /Noun /Noun /Noun /Josa /Nou # nction /Noun /Josa /Adjective /Noun /Verb /Verb /Noun /Josa # /Noun /Noun /Noun ?/Punctuation /Noun /Noun ./Punctuation /Noun /No # b /Noun /Noun /KoreanParticle /Noun /Noun /Adjective ./Punctuation # osa /Noun /Josa /Noun /Josa /Noun /Josa /Noun /Noun /Josa /Nou # /Noun /Josa /Noun ./Punctuation /Noun ,/Punctuation /Noun /Noun /Suf # Josa /Noun /Noun /Noun /Noun /Noun /Josa /Noun /Sufﬁx /Josa / # /Verb ./Punctuation /Adjective /Noun /Noun .../Punctuation /Noun # tive /Noun /Noun .../Punctuation /Noun /Noun ../Punctuation /Noun /J * Huang, Socher, Improving Word Representations via Global Context and Multiple Word Prototypes, ACL, 2012. 65 / 69
• 66. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) . . train_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_train_docs] train_y = [doc.tags[0] for doc in tagged_train_docs] len(train_x) # term existance # => 150000 len(train_x[0]) # => 300 test_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_test_docs] test_y = [doc.tags[0] for doc in tagged_test_docs] len(test_x) # => 50000 len(test_x[0]) # => 300 66 / 69
• 67. 4. Sentiment classiﬁcation with doc2vec (feat. Gensim) scikit-learn LogisticRegression() . from sklearn.linear_model import LogisticRegression classiﬁer = LogisticRegression(random_state=1234) classiﬁer.ﬁt(train_x, train_y) classiﬁer.score(test_x, test_y) # => 0.78246000000000004 ! Accuracy 0.78 . Term existance accuracy 2% , . , term existance , doc2vec ! 67 / 69
• 68. . KoNLPy , NLTK , Gensim . Term existance document vector . task apply ! / * / ! * , , , : , SNUDM Technical Reports, Apr 14, 2015. 68 / 69
Current LanguageEnglish
Español
Portugues
Français
Deutsche