<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)

<Little Big Data #1>
summatic@scatterlab.co.kr
1

•  
(@ , 2016. 1~)

•  
(2016. 8~)

•  
(2018. 5~)
!2

:
:
:•

• (?)

• B

•
.

•
.

•
.

•
.

•

• ( , id )
.

• ( , , )
.
!3

• Intro

•

•

•

•

•

• Preprocessing

• Word Embedding

• Document Similarity

•
!4

•

•

• “ ” -> “ " -> “ ” .

• “ ” .

• .

• .

•

• .

• .
!6

-
• Hell

• .

•

•

•

•

•
< >
- ?
- ? / ? ?
< > , ,
< > , ,
< >
< > , , ,
!7

•

•

•
-
< >
/ / ? / / ? / ? /  
< >
- (X) -> (O)
- ? (X) -> ? (O)
- ? (X) -> ? (O)
- (X) -> (O)
< >
-
-
!8

- preprocess
• Data Science

• Garbage in, Garbage out

• , preprocess
.

• preprocess ?
!10

Preprocessing -
•

• preprocess (POS1 tagger)
.

• :

• KoNLPy2

•

• , ,
1) POS: Part of speech

2) http://konlpy-ko.readthedocs.io/ko/v0.4.3/
!12

Preprocessing - ( )
• . ?
• _NP _MAG _VV _ECE  
_VXA _EFN ._SF _MAG  
_VV _EFQ ?_SF
•
• _NP _MAG _NNG  
_XSV _ECE
• .
• _NNG _VA _ECD _VV  
_EFN ._SF _MAG _VV  
_ECE _NNG _XSV _ECS
< > < >
!13

Preprocessing - ( )
• . ?
• _UN _JKS _MAG _MAG  
_VV _ECE _NNG _MAG  
_MAG _VV _ECS ?_SF
•
• _NP _NNG _NNG  
_JKM _VV
• .
• _NNG _VA _ECD _NP  
_UN ._SF _MAG _VV _ECE
_MAG _VV _ECS _EMO
< > < >
!15

Preprocessing -
•

• ( , corpus)

• (corpus)

•
!17
: https://ko.wikipedia.org/wiki/

Preprocessing -
• Sejong Corpus

• National Institute of the Korean Language, 1998-2007.

•

• (..)
!18
: https://ithub.korean.go.kr/user/guide/corpus/guide1.do

• preprocess

• normalize( )

• preprocessing

•

• tokenizing
< >
count(“ ”) < count(“ ?”) , “ ” .
Preprocessing -
!19

Preprocessing - Tokenizing
• Tokenizing:

• token , .

• , token

• “ ” “ ” tokenizing
.
!20
< >
before tokenizing:
.
after tokenizing:
/ / / / / / / / / / / / / / / /
/ / / / .

•

•

• c1c2..cn-1 cn c1..cn

•
Preprocessing - Tokenizing(Cohesion Probability)
!21
< >
“ ” “ ” .
: https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/05/05/cohesion/

Preprocessing - Tokenizing(Cohesion Probability)
• )

• = +
!22
substring count
- count( ) = 20000
- count( ) = 1500
- count( ) = 1200
- count( ) = 30
- count( ) = 15
cohesion probability
- CP( ) = 0.2738
- CP( ) = 0.3914
- CP( ) = 0.1968
- CP( ) = 0.2371

Preprocessing - Tokenizing
• Cohesion probability .

• .

• [ 2017] NLP -

•

• https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp

•

• https://github.com/lovit/soynlp
!23

Word Embedding - Word2Vec
• vector .

• word embedding word representation .

• word2vec

• You shall know a word by the company it keeps (Firth, J. R. 1957:11)
!25

Word Embedding - Word2Vec
• word2vec OOV
.

• OOV(Out-of-vocabulary): (=dictionary ) vocabulary
vector

• training input vocabulary OOV
, inference .

• inference :

•

• ( , )
, dictionary .
!26

• word2vec

• word2vec:

•

• fasttext:

• where the set of n grams appearing in w

• subword
Word Embedding - Fasttext
!27
< >
w: Alpaca
n grams of w (n=3) = <Al, Alp, lpa, pac, aca, ca>
: Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:
1607.04606.

Word Embedding - Fasttext +
• fasttext .

• (character) subword

• subword

• , OOV .
!28
< >
subwords( ) = < , , , , >
< >
= _ _ _
subwords( ) = < , _, _ , …, >

•
!29
- , 0.8590
- , 0.8465
- , 0.8180
- , 0.8055
- , 0.8018
- , 0.8017
- , 0.8007
- , 0.7983
- , 0.7972
- , 0.7948
- , 0.9022
- , 0.8986
- , 0.8887
- , 0.8866
- , 0.8567
- , 0.8498
- , 0.8474
- , 0.8413
- , 0.8335
- , 0.8191

•

•

• Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word
vectors with subword information. arXiv preprint arXiv:1607.04606.

•

• https://github.com/facebookresearch/fastText

• https://radimrehurek.com/gensim/models/fasttext.html

• https://github.com/summatic/hangul_jamo_fasttext
!30

Setence Similarity
• document
.

• document short sentence .

• word embedding vector embedding
cosine similarity .
!32
< >
sim( , ?)

Sentence Similarity - BOW + Word Embedding
• word vector

• doc2vec

• word embedding

• word embedding ?

• word embedding

• !=
!34
- similarity( , ) = 0.9011
- similarity( , ) = 0.8839
- similarity( , ) = 0.9707

Sentence Similarity - RNN
• sentence embedding RNN (LSTM, Bi-
RNN, GRU ) .

• RNN language modeling

• “ .” <-> “ ”

• sequence embedding .

• .. “ ” “ ” embedding .

• “?”
!35

Sentence Similarity - Term vector
• vector embedding
embedding .

• embedding term vector

• one hot encoding .

• term vector cosine similarity, edit distance
.
!36
< >
- I love you, you love me
- {“I”: 1, “love”: 2, “you”: 2, “me”: 1}

Sentence Similarity - Term vector
• term vector

• .

•

• pair1 pair2 ?
!38
< >
pair1: I love you <-> I like you
pair2: I love you <-> I hate you

Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• (=word vector)

• cosine similarity

• ESA similarity
!39
I love you
I like you
similarity I love you
I 1 0.2 0.5
like 0.3 0.9 0.4
you 0.5 0.4 1
1 0.9 1


• (=word vector)

• cosine similarity

• ESA similarity
!40
I love you
I hate you
similarity I love you
I 1 0.2 0.5
hate 0.3 0.5 0.4
you 0.5 0.4 1
1 0.5 1


• I love you

• .
!41
I like you I hate you
cosine 0.667 0.667
ESA 0.967 0.833

• .

•

• Song, Y., & Roth, D. (2015). Unsupervised sparse vector densiﬁcation for short
text similarity. In Proceedings of the 2015 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies (pp. 1275-1280).

•

• ( )
!42

• preprocessing 80%

• Zipf’s law

• corpus ,

• ( ) .

•

•

• , count based

• unlabeled data label

• label insight
!44

<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)

Recommended

Recommended

More Related Content

Similar to <Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)

Similar to <Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정) (20)

Recently uploaded

Recently uploaded (20)

<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)