18. Preprocessing -
• Sejong Corpus
• National Institute of the Korean Language, 1998-2007.
•
• (..)
!18
: https://ithub.korean.go.kr/user/guide/corpus/guide1.do
25. Word Embedding - Word2Vec
• vector .
• word embedding word representation .
• word2vec
• You shall know a word by the company it keeps (Firth, J. R. 1957:11)
!25
27. • word2vec
• word2vec:
•
• fasttext:
• where the set of n grams appearing in w
• subword
Word Embedding - Fasttext
!27
< >
w: Alpaca
n grams of w (n=3) = <Al, Alp, lpa, pac, aca, ca>
: Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:
1607.04606.
36. Sentence Similarity - Term vector
• vector embedding
embedding .
• embedding term vector
• one hot encoding .
• term vector cosine similarity, edit distance
.
!36
< >
- I love you, you love me
- {“I”: 1, “love”: 2, “you”: 2, “me”: 1}
37.
38. Sentence Similarity - Term vector
• term vector
• .
•
• pair1 pair2 ?
!38
< >
pair1: I love you <-> I like you
pair2: I love you <-> I hate you
39. Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis
• (=word vector)
• cosine similarity
• ESA similarity
!39
I love you
I like you
similarity I love you
I 1 0.2 0.5
like 0.3 0.9 0.4
you 0.5 0.4 1
1 0.9 1
40. Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis
• (=word vector)
• cosine similarity
• ESA similarity
!40
I love you
I hate you
similarity I love you
I 1 0.2 0.5
hate 0.3 0.5 0.4
you 0.5 0.4 1
1 0.5 1
41. Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis
• I love you
• .
!41
I like you I hate you
cosine 0.667 0.667
ESA 0.967 0.833
42. Sentence Similarity - ESA Similarity
• .
•
• Song, Y., & Roth, D. (2015). Unsupervised sparse vector densification for short
text similarity. In Proceedings of the 2015 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies (pp. 1275-1280).
•
• ( )
!42
43.
44. • preprocessing 80%
• Zipf’s law
• corpus ,
• ( ) .
•
•
• , count based
• unlabeled data label
• label insight
!44