Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Korean Text mining
1. Kyunghoon Kim
UNIST
Department of Mathematical Sciences
December 12, 2017
kyunghoon@unist.ac.kr
A Mathematical Measurement for Korean Text mining
and its applications
2. Difficulty of Korean Language
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 2 / 83
• New Concepts
- Korean Alphabet ( , , , , …)
- End of a word ( ) ( , , , …)
- Postposition ( ) ( , , , , …)
- Word order ( ) (SOV, …)
- …
Language Destruction
3. Outline
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 3 / 83
1. Text Summarization
Korean Text Mining
2. Text Clustering
3. Learning of
Text Relationship
Korean
Language
Feature V2
Syllable
Vector
Heterogeneous Word2Vec
( Law2Vec )
Fuzzy System Term-Frequency Matrix
( LSI, NMF )
Artificial Neural Network
( Word2Vec )
2013’ 2015’ 2017’
4. 1. Text Summarization | Motivation
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 4 / 83
< Raw News Article > < Summarized News >
March, 2013
How about Korean?
News Article
Summarized
Sentences
5. 1. Text Summarization | Process
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 5 / 83
Document Preprocessing Feature Selection Scoring by Model Refinement & Sorting by score
NNP,*,T, ,*,*,*,*
JKB,*,F, ,*,*,*,*
NNG,*,T, ,*,*,*,*
NNG,*,T, ,*,*,*,*
JC,*,F, ,*,*,*,*
NNG,*,T, ,*,*,*,*
XSN,*,T, ,*,*,*,*
NNG,*,T, ,*,*,*,*
NNG,*,T, ,*,*,*,*
XSN,*,T, ,*,*,*,*
NNG,*,F, ,*,*,*,*
JC,*,F, ,*,*,*,*
NNG,*,T, ,*,*,*,*
NNG,*,F,
,Compound,*,*, /NNG/*+ /
NNG/*
JKS,*,F, ,*,*,*,*
MAG, / ,F,
,*,*,*,*
VV,*,F, ,*,*,*,*
EC,*,F, ,*,*,*,*
VX,*,T, ,*,*,*,*
EF,*,F, ,*,*,*,*
. SF,*,*,*,*,*,*,*
• Content word(Keyword) feature
• Title word feature
• Sentence location feature
• Sentence Length feature
• Proper Noun feature
• Upper-case word feature
• Cue-Phrase feature
• Biased Word feature
• Font based feature
• Pronouns
• Sentence-to-Sentence Cohesion
• Sentence-to-Centroid Cohesion
• Occurrence of non-essential information
• Discourse analysis
Only for
English features
6. 1. Text Summarization | Feature measurements
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 6 / 83
Feature based on English
Feature based on Korean
, , , , , ... , , , , ...
7. 1. Text Summarization | Fuzzy Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 7 / 83
8. 1. Text Summarization | Fuzzy Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 8 / 83
9. 1. Text Summarization | Fuzzy Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 9 / 83
10. 1. Text Summarization | Fuzzy Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 10 / 83
11. 1. Text Summarization | Calculating the score of sentences
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 11 / 83
12. 1. Text Summarization | Korean Text Summarization
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 12 / 83
http://summ-dev.ap-northeast-2.elasticbeanstalk.com/
13. 1. Text Summarization | Patent, 2013
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 13 / 83
https://goo.gl/blkjwf
Korean News Summarization System And Method
14. 2. Text Clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 14 / 83
Text Clustering
15. 2. Text Clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 15 / 83
MatrixDocuments
1. Select Matrix
2. Calculating similarity
between each column of matrix
3. Clustering by the degree of similarity
A =
0
B
B
B
@
a11 a12 ··· a1n
a21 a22 ··· a2n
...
...
...
...
am1 am2 ··· amn
1
C
C
C
A
Convert
A. Basic (using raw matrix)
B. LSI (Latent Semantic Indexing)
C. NMF (Non-negative Matrix Factorization)
16. 2. Text Clustering | Term-Frequency Matrix
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 16 / 83
= { apple, banana, kiwi }
= { apple, banana, store }
= { store }
d1
d2
d3
A =
2
4 d1 · · · dn
3
5
Term-Frequency Matrix
Frequency
Document vector
18. 2. Text Clustering | Latent Semantic Indexing (LSI)
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 28 / 83
19. 2. Text Clustering | Non-negative Matrix Factorization (NMF)
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 31 / 83
20. 2. Text Clustering | Non-negative Matrix Factorization (NMF)
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 32 / 83
Doc1
Doc2
Doc3
Feature
1
Feature
2
Feature 1
Feature 2
Term
1
Term
2
Term
3
Term
4
Term
5
21. 2. Text Clustering | Term-Frequency Matrix
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 16 / 83
Large Dimension Matrix
for large-scale set
Proposed method
Syllable Vector
22. 2. Text Clustering | Syllable-n Vector
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 17 / 83
about 1,200
dimension
23. 2. Text Clustering | Dimension reduction using Syllable-n vector
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 18 / 83
Dimension Reduction
by Syllable Vector
Syllable-1 Syllable-2 Syllable-3
24. 2. Text Clustering | Syllable-n-All Vector
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 19 / 83
Syllable-1-All Syllable-2-All
, , , , , , , ,
✓
lj
n
◆
length of word wj
Take all combination of syllable-n
25. 2. Text Clustering | Benchmark Dataset HKIB-20000
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 23 / 83
Dimension reduction
How about information loss?
26. 2. Text Clustering | Similarity
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 20 / 83
✓
a
b
sim(d1, d2) =
v
u
u
t2 1
2/9
p
3/9
p
3/9
!
= 0.8164
sim(d2, d3) = 0.919
sim(d1, d3) = 1.414
27. 2. Text Clustering | Similarity
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 24 / 83
Source :
Doc Number 5222
Target :
Other all documents
28. 2. Text Clustering | Correlation
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 25 / 83
Basic
LSI
NMF
29. 2. Text Clustering | Evaluation of Text Clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 34 / 83
30. 2. Text Clustering | Precision
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 35 / 83
Real
Answer
TP
FP
Precision =
5
7
= 0.71
31. 2. Text Clustering | Evaluation Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 36 / 83
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
…
32. 2. Text Clustering | Standard for Evaluation
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 37 / 83
1
2
3
4
5
Nearest
neighbors
Limited
Radius
33. 2. Text Clustering | Evaluation of text clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 38 / 83
Radius Threshold
Syl-2
Syl-3
Word
34. 2. Text Clustering | Evaluation of text clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 39 / 83
Count Threshold
35. 2. Text Clustering | Evaluation of text clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 40 / 83
Precision Speed
n = 5 , count threshold
LSI LSI
36. 2. Text Clustering | Evaluation of text clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 41 / 83
Syl-2 for LSI
is BEST!
37. 2. Text Clustering | Patent
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 42 / 83
https://goo.gl/fskHxTKorean Text Clustering System and Method
38. 2. Text Clustering | Limitation of word-based method
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 43 / 83
These words are NOT important
to understand the given text!
Limitation of word-based method
39. Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 44 / 83
3. Learning of Text Relationship
Word-based
Citation Relation
Find similar documents
using citation information
40. 3. Learning of Text Relationship | Natural Language Processing (NLP)
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 45 / 83
https://www.upwork.com/hiring/for-clients/artificial-intelligence-and-natural-language-processing-in-big-data/
41. 3. Learning of Text Relationship | Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 46 / 83
2013, Hot Model in NLP
“Word2Vec” (Google)
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
(맥도날드가, 햄버거는)
(맥도날드가, 맛있다.)
(맛있다., 맥도날드가)
(맛있다., 감자튀김도)
(감자튀김도, 맛있다.)
(감자튀김도, 맛있었는데..)
(맘스터치도, 햄버거는)
(맘스터치도, 맛있다.)
(맛있다., 맘스터치도)
(맛있다., 패티가)
Source Text
Red : Target keyword, Blue : Context Keyword
Training Set
42. 3. Learning of Text Relationship | Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 47 / 83
(맥도날드가, 햄버거는)
(맥도날드가, 맛있다.)
Input, Output
43. 3. Learning of Text Relationship | Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 48 / 83
44. 3. Learning of Text Relationship | Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 50 / 83
Shortage of Word2Vec
• Only Word-based Method
=> Meaningless words are also counted.
• Only Same vocabulary set for input, output
=> Dimensions of input, output are fixed.
• Only use a context information of target word
=> depends entirely on context with windows size N.
45. 3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 51 / 83
Heterogeneous Word2Vec
Input Output
46. 3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 52 / 83
47. 3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 53 / 83
1
2
3
4
5
1
2
3
4
5
0
6
48. 3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 54 / 83
1
0
0
0
0
0
0
0
0
0
1
0
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
49. 3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 55 / 83
1
0
0
0
0
0
1
0
0
0
0
0
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
50. Ch 4. Learning for number relationship
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 56 / 83
3. Learning of Text Relationship | Heterogeneous Word2Vec
1
2
3
4
5
1
2
3
4
5
0
6
51. Ch 4. Learning for number relationship
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 57 / 83
3. Learning of Text Relationship | Heterogeneous Word2Vec
1
2
3
4
5
1
2
3
4
5
0
6
52. 3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 58 / 83
1
2
3
4
5
1
2
3
4
5
Similarity
( 0 is best )
Matrix (Vectors)
53. 3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 60 / 83
54. 3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 62 / 83
Input Output
55. 3. Learning of Text Relationship | Law2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 63 / 83
Legal information comprises mainly of legislation and case.
• CL ( Case - Legislation )
• CC ( Case - Case )
• CLC ( Case - Legislation, Case )
56. 3. Learning of Text Relationship | Law2Vec CL Model, CC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 64 / 83
Cited legislations Cited cases
Case Case
57. 3. Learning of Text Relationship | Law2Vec CLC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 65 / 83
Cited legislations Cited cases
Case
58. 3. Learning of Text Relationship | Evaluation of Law2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 66 / 83
59. 3. Learning of Text Relationship | Evaluation of CL Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 67 / 83
Cited legislations
Case
60. 3. Learning of Text Relationship | Evaluation of Law2Vec : W_1
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 68 / 83
61. 3. Learning of Text Relationship | Evaluation of Law2Vec : W_2
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 69 / 83
62. 3. Learning of Text Relationship | Evaluation of CC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 70 / 83
Cited cases
Case
63. 3. Learning of Text Relationship | Evaluation of CC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 71 / 83
64. 3. Learning of Text Relationship | Evaluation of CLC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 73 / 83
Cited legislations Cited cases
Case
65. 3. Learning of Text Relationship | Evaluation of CLC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 74 / 83
66. 3. Learning of Text Relationship | Result of Law2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 76 / 83
67. 3. Learning of Text Relationship | Result of Law2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 77 / 83
68. 3. Learning of Text Relationship | Expansion of Data set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 78 / 83
< Lawyer Oh’s Answer Sheet >
69. 3. Learning of Text Relationship | Law2vec for Sample Data
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 79 / 83
CL Model
Iteration
10000
CL Model
Iteration
60000
CC Model
Iteration
10000
CC Model
Iteration
60000
CLC Model
Iteration
10000
CLC Model
Iteration
60000
70. 3. Learning of Text Relationship | Link Prediction
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 81 / 83
71. Conclusion
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 82 / 83
1. Main Contribution
Korean Language Feature V2 (JKB, JX)
Syllable Vector
Heterogeneous Word2Vec ( Law2Vec )
2. Advantage
Chapter 2.
Text Summarization
Chapter 3.
Text Clustering
Chapter 4.
Text Relational Learning To summarize by linguistic feature for Korean
To get the dimension reduction with a small amount of information
loss using Syllable vector and to make efficient computing for large-
scale document set.
To learn of heterogeneous data by using the relationship
between them without text(word) data
72. Conclusion
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 83 / 83
3. Interest to readerChapter 2.
Text Summarization
Chapter 3.
Text Clustering
Chapter 4.
Text Relational Learning
To apply Fuzzy Concept to text mining considering Language
features
=> Define your idea and apply it to system easily
Korean language has more efficient for large-scale document set
=> Korean language is adequate to compress text data
Design the NN system to fit the structure of your data
=> Meta-data is a good enough material to learn the relationship
between them.