Kyunghoon Kim
UNIST
Department of Mathematical Sciences
December 12, 2017
kyunghoon@unist.ac.kr
A Mathematical Measurement for Korean Text mining
and its applications
Difficulty of Korean Language
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 2 / 83
• New Concepts

- Korean Alphabet ( , , , , …)

- End of a word ( ) ( , , , …)

- Postposition ( ) ( , , , , …)

- Word order ( ) (SOV, …)

- …
Language Destruction
Outline
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 3 / 83
1. Text Summarization
Korean Text Mining
2. Text Clustering
3. Learning of
Text Relationship
Korean
Language

Feature V2
Syllable
Vector
Heterogeneous Word2Vec
( Law2Vec )
Fuzzy System Term-Frequency Matrix
( LSI, NMF )
Artificial Neural Network
( Word2Vec )
2013’ 2015’ 2017’
1. Text Summarization | Motivation
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 4 / 83
< Raw News Article > < Summarized News >
March, 2013
How about Korean?
News Article
Summarized
Sentences
1. Text Summarization | Process
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 5 / 83
Document Preprocessing Feature Selection Scoring by Model Refinement & Sorting by score
NNP,*,T, ,*,*,*,*
JKB,*,F, ,*,*,*,*

NNG,*,T, ,*,*,*,*
NNG,*,T, ,*,*,*,*
JC,*,F, ,*,*,*,*

NNG,*,T, ,*,*,*,*
XSN,*,T, ,*,*,*,*

NNG,*,T, ,*,*,*,*
NNG,*,T, ,*,*,*,*
XSN,*,T, ,*,*,*,*

NNG,*,F, ,*,*,*,*
JC,*,F, ,*,*,*,*

NNG,*,T, ,*,*,*,*
NNG,*,F,
,Compound,*,*, /NNG/*+ /
NNG/*
JKS,*,F, ,*,*,*,*

MAG, / ,F,
,*,*,*,*

VV,*,F, ,*,*,*,*

EC,*,F, ,*,*,*,*

VX,*,T, ,*,*,*,*

EF,*,F, ,*,*,*,*

. SF,*,*,*,*,*,*,*
• Content word(Keyword) feature
• Title word feature
• Sentence location feature
• Sentence Length feature
• Proper Noun feature
• Upper-case word feature
• Cue-Phrase feature
• Biased Word feature
• Font based feature
• Pronouns
• Sentence-to-Sentence Cohesion
• Sentence-to-Centroid Cohesion
• Occurrence of non-essential information
• Discourse analysis
Only for
English features
1. Text Summarization | Feature measurements
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 6 / 83
Feature based on English
Feature based on Korean
, , , , , ... , , , , ...
1. Text Summarization | Fuzzy Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 7 / 83
1. Text Summarization | Fuzzy Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 8 / 83
1. Text Summarization | Fuzzy Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 9 / 83
1. Text Summarization | Fuzzy Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 10 / 83
1. Text Summarization | Calculating the score of sentences
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 11 / 83
1. Text Summarization | Korean Text Summarization
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 12 / 83
http://summ-dev.ap-northeast-2.elasticbeanstalk.com/
1. Text Summarization | Patent, 2013
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 13 / 83
https://goo.gl/blkjwf
Korean News Summarization System And Method
2. Text Clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 14 / 83
Text Clustering
2. Text Clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 15 / 83
MatrixDocuments
1. Select Matrix







2. Calculating similarity

between each column of matrix
3. Clustering by the degree of similarity
A =
0
B
B
B
@
a11 a12 ··· a1n
a21 a22 ··· a2n
...
...
...
...
am1 am2 ··· amn
1
C
C
C
A
Convert
A. Basic (using raw matrix)
B. LSI (Latent Semantic Indexing)
C. NMF (Non-negative Matrix Factorization)
2. Text Clustering | Term-Frequency Matrix
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 16 / 83
= { apple, banana, kiwi }
= { apple, banana, store }
= { store }
d1
d2
d3
A =
2
4 d1 · · · dn
3
5
Term-Frequency Matrix
Frequency
Document vector
2. Text Clustering | Singular Value Decomposition (SVD)
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 27 / 83
d1 d2 d3
w1 1 0 0
w2 0 1 0
w3 1 1 1
w4 1 1 0
w5 0 0 1
-0.27 0.21 0.70 -0.53 0.30
-0.27 0.21 -0.70 -0.53 0.30
-0.71 -0.33 0 -0.10 -0.60
-0.55 0.43 0 0.64 0.29
-0.15 -0.77 0 0.10 0.60
2.35 0 0
0 1.19 0
0 0 1.00
0 0 0
0 0 0
-0.65 0.26 0.70
-0.65 0.26 -0.70
-0.36 -0.92 0
=
2. Text Clustering | Latent Semantic Indexing (LSI)
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 28 / 83
2. Text Clustering | Non-negative Matrix Factorization (NMF)
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 31 / 83
2. Text Clustering | Non-negative Matrix Factorization (NMF)
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 32 / 83
Doc1
Doc2
Doc3
Feature
1
Feature
2
Feature 1
Feature 2
Term
1
Term
2
Term
3
Term
4
Term
5
2. Text Clustering | Term-Frequency Matrix
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 16 / 83
Large Dimension Matrix
for large-scale set
Proposed method
Syllable Vector
2. Text Clustering | Syllable-n Vector
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 17 / 83
about 1,200
dimension
2. Text Clustering | Dimension reduction using Syllable-n vector
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 18 / 83
Dimension Reduction
by Syllable Vector
Syllable-1 Syllable-2 Syllable-3
2. Text Clustering | Syllable-n-All Vector
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 19 / 83
Syllable-1-All Syllable-2-All
, , , , , , , ,
✓
lj
n
◆
length of word wj
Take all combination of syllable-n
2. Text Clustering | Benchmark Dataset HKIB-20000
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 23 / 83
Dimension reduction
How about information loss?
2. Text Clustering | Similarity
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 20 / 83
✓
a
b
sim(d1, d2) =
v
u
u
t2 1
2/9
p
3/9
p
3/9
!
= 0.8164
sim(d2, d3) = 0.919
sim(d1, d3) = 1.414
2. Text Clustering | Similarity
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 24 / 83
Source :
Doc Number 5222
Target :
Other all documents
2. Text Clustering | Correlation
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 25 / 83
Basic
LSI
NMF
2. Text Clustering | Evaluation of Text Clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 34 / 83
2. Text Clustering | Precision
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 35 / 83
Real
Answer
TP
FP
Precision =
5
7
= 0.71
2. Text Clustering | Evaluation Set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 36 / 83
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
…
2. Text Clustering | Standard for Evaluation
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 37 / 83
1
2
3
4
5
Nearest
neighbors
Limited
Radius
2. Text Clustering | Evaluation of text clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 38 / 83
Radius Threshold
Syl-2
Syl-3
Word
2. Text Clustering | Evaluation of text clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 39 / 83
Count Threshold
2. Text Clustering | Evaluation of text clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 40 / 83
Precision Speed
n = 5 , count threshold
LSI LSI
2. Text Clustering | Evaluation of text clustering
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 41 / 83
Syl-2 for LSI
is BEST!
2. Text Clustering | Patent
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 42 / 83
https://goo.gl/fskHxTKorean Text Clustering System and Method
2. Text Clustering | Limitation of word-based method
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 43 / 83
These words are NOT important
to understand the given text!
Limitation of word-based method
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 44 / 83
3. Learning of Text Relationship
Word-based
Citation Relation
Find similar documents
using citation information
3. Learning of Text Relationship | Natural Language Processing (NLP)
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 45 / 83
https://www.upwork.com/hiring/for-clients/artificial-intelligence-and-natural-language-processing-in-big-data/
3. Learning of Text Relationship | Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 46 / 83
2013, Hot Model in NLP
“Word2Vec” (Google)
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
(맥도날드가, 햄버거는)
(맥도날드가, 맛있다.)
(맛있다., 맥도날드가)
(맛있다., 감자튀김도)
(감자튀김도, 맛있다.)
(감자튀김도, 맛있었는데..)
(맘스터치도, 햄버거는)
(맘스터치도, 맛있다.)
(맛있다., 맘스터치도)
(맛있다., 패티가)
Source Text
Red : Target keyword, Blue : Context Keyword
Training Set
3. Learning of Text Relationship | Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 47 / 83
(맥도날드가, 햄버거는)
(맥도날드가, 맛있다.)
Input, Output
3. Learning of Text Relationship | Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 48 / 83
3. Learning of Text Relationship | Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 50 / 83
Shortage of Word2Vec
• Only Word-based Method

=> Meaningless words are also counted.

• Only Same vocabulary set for input, output

=> Dimensions of input, output are fixed.

• Only use a context information of target word

=> depends entirely on context with windows size N.
3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 51 / 83
Heterogeneous Word2Vec
Input Output
3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 52 / 83
3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 53 / 83
1
2
3
4
5
1
2
3
4
5
0
6
3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 54 / 83
1
0
0
0
0
0
0
0
0
0
1
0
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 55 / 83
1
0
0
0
0
0
1
0
0
0
0
0
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Ch 4. Learning for number relationship
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 56 / 83
3. Learning of Text Relationship | Heterogeneous Word2Vec
1
2
3
4
5
1
2
3
4
5
0
6
Ch 4. Learning for number relationship
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 57 / 83
3. Learning of Text Relationship | Heterogeneous Word2Vec
1
2
3
4
5
1
2
3
4
5
0
6
3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 58 / 83
1
2
3
4
5
1
2
3
4
5
Similarity
( 0 is best )
Matrix (Vectors)
3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 60 / 83
3. Learning of Text Relationship | Heterogeneous Word2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 62 / 83
Input Output
3. Learning of Text Relationship | Law2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 63 / 83
Legal information comprises mainly of legislation and case.
• CL ( Case - Legislation )
• CC ( Case - Case )
• CLC ( Case - Legislation, Case )
3. Learning of Text Relationship | Law2Vec CL Model, CC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 64 / 83
Cited legislations Cited cases
Case Case
3. Learning of Text Relationship | Law2Vec CLC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 65 / 83
Cited legislations Cited cases
Case
3. Learning of Text Relationship | Evaluation of Law2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 66 / 83
3. Learning of Text Relationship | Evaluation of CL Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 67 / 83
Cited legislations
Case
3. Learning of Text Relationship | Evaluation of Law2Vec : W_1
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 68 / 83
3. Learning of Text Relationship | Evaluation of Law2Vec : W_2
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 69 / 83
3. Learning of Text Relationship | Evaluation of CC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 70 / 83
Cited cases
Case
3. Learning of Text Relationship | Evaluation of CC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 71 / 83
3. Learning of Text Relationship | Evaluation of CLC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 73 / 83
Cited legislations Cited cases
Case
3. Learning of Text Relationship | Evaluation of CLC Model
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 74 / 83
3. Learning of Text Relationship | Result of Law2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 76 / 83
3. Learning of Text Relationship | Result of Law2Vec
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 77 / 83
3. Learning of Text Relationship | Expansion of Data set
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 78 / 83
< Lawyer Oh’s Answer Sheet >
3. Learning of Text Relationship | Law2vec for Sample Data
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 79 / 83
CL Model
Iteration
10000
CL Model
Iteration
60000
CC Model
Iteration
10000
CC Model
Iteration
60000
CLC Model
Iteration
10000
CLC Model
Iteration
60000
3. Learning of Text Relationship | Link Prediction
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 81 / 83
Conclusion
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 82 / 83
1. Main Contribution
Korean Language Feature V2 (JKB, JX)
Syllable Vector
Heterogeneous Word2Vec ( Law2Vec )
2. Advantage
Chapter 2.
Text Summarization
Chapter 3.
Text Clustering
Chapter 4.
Text Relational Learning To summarize by linguistic feature for Korean
To get the dimension reduction with a small amount of information
loss using Syllable vector and to make efficient computing for large-
scale document set.
To learn of heterogeneous data by using the relationship
between them without text(word) data
Conclusion
Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 83 / 83
3. Interest to readerChapter 2.
Text Summarization
Chapter 3.
Text Clustering
Chapter 4.
Text Relational Learning
To apply Fuzzy Concept to text mining considering Language
features
=> Define your idea and apply it to system easily
Korean language has more efficient for large-scale document set
=> Korean language is adequate to compress text data
Design the NN system to fit the structure of your data
=> Meta-data is a good enough material to learn the relationship
between them.

Korean Text mining

  • 1.
    Kyunghoon Kim UNIST Department ofMathematical Sciences December 12, 2017 kyunghoon@unist.ac.kr A Mathematical Measurement for Korean Text mining and its applications
  • 2.
    Difficulty of KoreanLanguage Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 2 / 83 • New Concepts
 - Korean Alphabet ( , , , , …)
 - End of a word ( ) ( , , , …)
 - Postposition ( ) ( , , , , …)
 - Word order ( ) (SOV, …)
 - … Language Destruction
  • 3.
    Outline Kyunghoon Kim (UNIST)A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 3 / 83 1. Text Summarization Korean Text Mining 2. Text Clustering 3. Learning of Text Relationship Korean Language
 Feature V2 Syllable Vector Heterogeneous Word2Vec ( Law2Vec ) Fuzzy System Term-Frequency Matrix ( LSI, NMF ) Artificial Neural Network ( Word2Vec ) 2013’ 2015’ 2017’
  • 4.
    1. Text Summarization| Motivation Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 4 / 83 < Raw News Article > < Summarized News > March, 2013 How about Korean? News Article Summarized Sentences
  • 5.
    1. Text Summarization| Process Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 5 / 83 Document Preprocessing Feature Selection Scoring by Model Refinement & Sorting by score NNP,*,T, ,*,*,*,* JKB,*,F, ,*,*,*,* NNG,*,T, ,*,*,*,* NNG,*,T, ,*,*,*,* JC,*,F, ,*,*,*,* NNG,*,T, ,*,*,*,* XSN,*,T, ,*,*,*,* NNG,*,T, ,*,*,*,* NNG,*,T, ,*,*,*,* XSN,*,T, ,*,*,*,* NNG,*,F, ,*,*,*,* JC,*,F, ,*,*,*,* NNG,*,T, ,*,*,*,* NNG,*,F, ,Compound,*,*, /NNG/*+ / NNG/* JKS,*,F, ,*,*,*,* MAG, / ,F, ,*,*,*,* VV,*,F, ,*,*,*,* EC,*,F, ,*,*,*,* VX,*,T, ,*,*,*,* EF,*,F, ,*,*,*,* . SF,*,*,*,*,*,*,* • Content word(Keyword) feature • Title word feature • Sentence location feature • Sentence Length feature • Proper Noun feature • Upper-case word feature • Cue-Phrase feature • Biased Word feature • Font based feature • Pronouns • Sentence-to-Sentence Cohesion • Sentence-to-Centroid Cohesion • Occurrence of non-essential information • Discourse analysis Only for English features
  • 6.
    1. Text Summarization| Feature measurements Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 6 / 83 Feature based on English Feature based on Korean , , , , , ... , , , , ...
  • 7.
    1. Text Summarization| Fuzzy Set Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 7 / 83
  • 8.
    1. Text Summarization| Fuzzy Set Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 8 / 83
  • 9.
    1. Text Summarization| Fuzzy Set Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 9 / 83
  • 10.
    1. Text Summarization| Fuzzy Set Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 10 / 83
  • 11.
    1. Text Summarization| Calculating the score of sentences Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 11 / 83
  • 12.
    1. Text Summarization| Korean Text Summarization Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 12 / 83 http://summ-dev.ap-northeast-2.elasticbeanstalk.com/
  • 13.
    1. Text Summarization| Patent, 2013 Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 13 / 83 https://goo.gl/blkjwf Korean News Summarization System And Method
  • 14.
    2. Text Clustering KyunghoonKim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 14 / 83 Text Clustering
  • 15.
    2. Text Clustering KyunghoonKim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 15 / 83 MatrixDocuments 1. Select Matrix
 
 
 
 2. Calculating similarity
 between each column of matrix 3. Clustering by the degree of similarity A = 0 B B B @ a11 a12 ··· a1n a21 a22 ··· a2n ... ... ... ... am1 am2 ··· amn 1 C C C A Convert A. Basic (using raw matrix) B. LSI (Latent Semantic Indexing) C. NMF (Non-negative Matrix Factorization)
  • 16.
    2. Text Clustering| Term-Frequency Matrix Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 16 / 83 = { apple, banana, kiwi } = { apple, banana, store } = { store } d1 d2 d3 A = 2 4 d1 · · · dn 3 5 Term-Frequency Matrix Frequency Document vector
  • 17.
    2. Text Clustering| Singular Value Decomposition (SVD) Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 27 / 83 d1 d2 d3 w1 1 0 0 w2 0 1 0 w3 1 1 1 w4 1 1 0 w5 0 0 1 -0.27 0.21 0.70 -0.53 0.30 -0.27 0.21 -0.70 -0.53 0.30 -0.71 -0.33 0 -0.10 -0.60 -0.55 0.43 0 0.64 0.29 -0.15 -0.77 0 0.10 0.60 2.35 0 0 0 1.19 0 0 0 1.00 0 0 0 0 0 0 -0.65 0.26 0.70 -0.65 0.26 -0.70 -0.36 -0.92 0 =
  • 18.
    2. Text Clustering| Latent Semantic Indexing (LSI) Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 28 / 83
  • 19.
    2. Text Clustering| Non-negative Matrix Factorization (NMF) Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 31 / 83
  • 20.
    2. Text Clustering| Non-negative Matrix Factorization (NMF) Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 32 / 83 Doc1 Doc2 Doc3 Feature 1 Feature 2 Feature 1 Feature 2 Term 1 Term 2 Term 3 Term 4 Term 5
  • 21.
    2. Text Clustering| Term-Frequency Matrix Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 16 / 83 Large Dimension Matrix for large-scale set Proposed method Syllable Vector
  • 22.
    2. Text Clustering| Syllable-n Vector Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 17 / 83 about 1,200 dimension
  • 23.
    2. Text Clustering| Dimension reduction using Syllable-n vector Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 18 / 83 Dimension Reduction by Syllable Vector Syllable-1 Syllable-2 Syllable-3
  • 24.
    2. Text Clustering| Syllable-n-All Vector Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 19 / 83 Syllable-1-All Syllable-2-All , , , , , , , , ✓ lj n ◆ length of word wj Take all combination of syllable-n
  • 25.
    2. Text Clustering| Benchmark Dataset HKIB-20000 Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 23 / 83 Dimension reduction How about information loss?
  • 26.
    2. Text Clustering| Similarity Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 20 / 83 ✓ a b sim(d1, d2) = v u u t2 1 2/9 p 3/9 p 3/9 ! = 0.8164 sim(d2, d3) = 0.919 sim(d1, d3) = 1.414
  • 27.
    2. Text Clustering| Similarity Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 24 / 83 Source : Doc Number 5222 Target : Other all documents
  • 28.
    2. Text Clustering| Correlation Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 25 / 83 Basic LSI NMF
  • 29.
    2. Text Clustering| Evaluation of Text Clustering Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 34 / 83
  • 30.
    2. Text Clustering| Precision Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 35 / 83 Real Answer TP FP Precision = 5 7 = 0.71
  • 31.
    2. Text Clustering| Evaluation Set Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 36 / 83 Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 …
  • 32.
    2. Text Clustering| Standard for Evaluation Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 37 / 83 1 2 3 4 5 Nearest neighbors Limited Radius
  • 33.
    2. Text Clustering| Evaluation of text clustering Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 38 / 83 Radius Threshold Syl-2 Syl-3 Word
  • 34.
    2. Text Clustering| Evaluation of text clustering Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 39 / 83 Count Threshold
  • 35.
    2. Text Clustering| Evaluation of text clustering Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 40 / 83 Precision Speed n = 5 , count threshold LSI LSI
  • 36.
    2. Text Clustering| Evaluation of text clustering Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 41 / 83 Syl-2 for LSI is BEST!
  • 37.
    2. Text Clustering| Patent Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 42 / 83 https://goo.gl/fskHxTKorean Text Clustering System and Method
  • 38.
    2. Text Clustering| Limitation of word-based method Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 43 / 83 These words are NOT important to understand the given text! Limitation of word-based method
  • 39.
    Kyunghoon Kim (UNIST)A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 44 / 83 3. Learning of Text Relationship Word-based Citation Relation Find similar documents using citation information
  • 40.
    3. Learning ofText Relationship | Natural Language Processing (NLP) Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 45 / 83 https://www.upwork.com/hiring/for-clients/artificial-intelligence-and-natural-language-processing-in-big-data/
  • 41.
    3. Learning ofText Relationship | Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 46 / 83 2013, Hot Model in NLP “Word2Vec” (Google) http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ (맥도날드가, 햄버거는) (맥도날드가, 맛있다.) (맛있다., 맥도날드가) (맛있다., 감자튀김도) (감자튀김도, 맛있다.) (감자튀김도, 맛있었는데..) (맘스터치도, 햄버거는) (맘스터치도, 맛있다.) (맛있다., 맘스터치도) (맛있다., 패티가) Source Text Red : Target keyword, Blue : Context Keyword Training Set
  • 42.
    3. Learning ofText Relationship | Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 47 / 83 (맥도날드가, 햄버거는) (맥도날드가, 맛있다.) Input, Output
  • 43.
    3. Learning ofText Relationship | Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 48 / 83
  • 44.
    3. Learning ofText Relationship | Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 50 / 83 Shortage of Word2Vec • Only Word-based Method
 => Meaningless words are also counted.
 • Only Same vocabulary set for input, output
 => Dimensions of input, output are fixed.
 • Only use a context information of target word
 => depends entirely on context with windows size N.
  • 45.
    3. Learning ofText Relationship | Heterogeneous Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 51 / 83 Heterogeneous Word2Vec Input Output
  • 46.
    3. Learning ofText Relationship | Heterogeneous Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 52 / 83
  • 47.
    3. Learning ofText Relationship | Heterogeneous Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 53 / 83 1 2 3 4 5 1 2 3 4 5 0 6
  • 48.
    3. Learning ofText Relationship | Heterogeneous Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 54 / 83 1 0 0 0 0 0 0 0 0 0 1 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B @ 1 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C A 0 B B B B B B B B B B B B B B B B B B B B @ 1 C C C C C C C C C C C C C C C C C C C C A
  • 49.
    3. Learning ofText Relationship | Heterogeneous Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 55 / 83 1 0 0 0 0 0 1 0 0 0 0 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B @ 1 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C A 0 B B B B B B B B B B B B B B B B B B B B @ 1 C C C C C C C C C C C C C C C C C C C C A
  • 50.
    Ch 4. Learningfor number relationship Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 56 / 83 3. Learning of Text Relationship | Heterogeneous Word2Vec 1 2 3 4 5 1 2 3 4 5 0 6
  • 51.
    Ch 4. Learningfor number relationship Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 57 / 83 3. Learning of Text Relationship | Heterogeneous Word2Vec 1 2 3 4 5 1 2 3 4 5 0 6
  • 52.
    3. Learning ofText Relationship | Heterogeneous Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 58 / 83 1 2 3 4 5 1 2 3 4 5 Similarity ( 0 is best ) Matrix (Vectors)
  • 53.
    3. Learning ofText Relationship | Heterogeneous Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 60 / 83
  • 54.
    3. Learning ofText Relationship | Heterogeneous Word2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 62 / 83 Input Output
  • 55.
    3. Learning ofText Relationship | Law2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 63 / 83 Legal information comprises mainly of legislation and case. • CL ( Case - Legislation ) • CC ( Case - Case ) • CLC ( Case - Legislation, Case )
  • 56.
    3. Learning ofText Relationship | Law2Vec CL Model, CC Model Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 64 / 83 Cited legislations Cited cases Case Case
  • 57.
    3. Learning ofText Relationship | Law2Vec CLC Model Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 65 / 83 Cited legislations Cited cases Case
  • 58.
    3. Learning ofText Relationship | Evaluation of Law2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 66 / 83
  • 59.
    3. Learning ofText Relationship | Evaluation of CL Model Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 67 / 83 Cited legislations Case
  • 60.
    3. Learning ofText Relationship | Evaluation of Law2Vec : W_1 Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 68 / 83
  • 61.
    3. Learning ofText Relationship | Evaluation of Law2Vec : W_2 Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 69 / 83
  • 62.
    3. Learning ofText Relationship | Evaluation of CC Model Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 70 / 83 Cited cases Case
  • 63.
    3. Learning ofText Relationship | Evaluation of CC Model Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 71 / 83
  • 64.
    3. Learning ofText Relationship | Evaluation of CLC Model Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 73 / 83 Cited legislations Cited cases Case
  • 65.
    3. Learning ofText Relationship | Evaluation of CLC Model Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 74 / 83
  • 66.
    3. Learning ofText Relationship | Result of Law2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 76 / 83
  • 67.
    3. Learning ofText Relationship | Result of Law2Vec Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 77 / 83
  • 68.
    3. Learning ofText Relationship | Expansion of Data set Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 78 / 83 < Lawyer Oh’s Answer Sheet >
  • 69.
    3. Learning ofText Relationship | Law2vec for Sample Data Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 79 / 83 CL Model Iteration 10000 CL Model Iteration 60000 CC Model Iteration 10000 CC Model Iteration 60000 CLC Model Iteration 10000 CLC Model Iteration 60000
  • 70.
    3. Learning ofText Relationship | Link Prediction Kyunghoon Kim (UNIST) A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 81 / 83
  • 71.
    Conclusion Kyunghoon Kim (UNIST)A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 82 / 83 1. Main Contribution Korean Language Feature V2 (JKB, JX) Syllable Vector Heterogeneous Word2Vec ( Law2Vec ) 2. Advantage Chapter 2. Text Summarization Chapter 3. Text Clustering Chapter 4. Text Relational Learning To summarize by linguistic feature for Korean To get the dimension reduction with a small amount of information loss using Syllable vector and to make efficient computing for large- scale document set. To learn of heterogeneous data by using the relationship between them without text(word) data
  • 72.
    Conclusion Kyunghoon Kim (UNIST)A Mathematical Measurement for Korean Text mining and its applications Dec 12, 2017 83 / 83 3. Interest to readerChapter 2. Text Summarization Chapter 3. Text Clustering Chapter 4. Text Relational Learning To apply Fuzzy Concept to text mining considering Language features => Define your idea and apply it to system easily Korean language has more efficient for large-scale document set => Korean language is adequate to compress text data Design the NN system to fit the structure of your data => Meta-data is a good enough material to learn the relationship between them.