[F2]자연어처리를 위한 기계학습 소개

강원대 컴퓨터과학과
이창기

  자연어처리 소개
◦  한국어 자동 띄어쓰기, 한국어 개체명 인식
  자연어처리를 위한 기계학습 소개:
◦  ME, CRF, SVM, Structural SVM
  한국어 자동 띄어쓰기 실험 결과
  한국어 개체명 인식 실험 결과

  자동 띄어쓰기 (word segmentation)
◦  단어(한국어인 경우 어절)간의 경계를 결정하는 문제
  중국어와 일본어에서는 필수적인 과정으로 이들 언어권에서 많은 연구가 진행됨
  주로 통계기반 - HMM, SVM, CRF
◦  한국어에서는 어절 단위 띄어쓰기를 하고 있음
  메신저, SMS, 블로그, 트윗, 댓글 등에서 띄어쓰기 오류 문제
  문자 인식기, 음성 인식기 등의 결과의 띄어쓰기 후처리 모듈
◦  일반적으로 띄어쓰기가 완전히 무시된 한국어 문장 입력 가정
  띄어쓰기가 잘 되어있는 코퍼스를 학습데이터로 사용
◦  한국어 띄어쓰기 선행연구
  주로 통계기반: HMM, CRF

  본 발표
◦  Structural SVM (Pegasos 알고리즘)을 이용한 한국어 자동 띄어쓰기

  x = <아, 버, 지, 가, 방, 에, 들, 어, 가, 셨, 다, ‘.’>
  y = < B, I, I, I, B, I, B, I, I, I, I, I>

  개체명 인식
◦  인명, 지명, 기관명 등의 개체명을 문서에서 추출하고 종류를 결정
◦  Message Understanding Conference (MUC)에서 본격적으로 연구
시작
◦  CoNLL 2002, 2003에서 더욱 발전됨
  최고 성적을 낸 IBM의 시스템은 여러 기계학습 방법을 voting, 영어의 경우 약
F1=89
◦  최근에 주로 이용되는 방법은 통계기반의 기계학습 방법
  HMM, MEMM, SVMs, CRFs
  본 발표
◦  Structural SVMs 및 Pegasos 알고리즘을 이용한 개체명 인식 시스템을
소개하고, 기존의 CRFs 기반의 개체명 인식 시스템과 비교 실험 수행

  x = <한나라당/nn, 조해진/nn, 대변인/nc, …>
  y = < B-ORG, B-Per, O, …>

x 한나라당/nn 조해진/nn 대변인/nc 은/jc … ⎛ 1 ⎞ B − Org → nn
⎜ ⎟
⎜ 0 ⎟ B − Org → nc
⎜ 0 ⎟ B − Org → jc
⎜ ⎟
Ψ (x, y ) = ⎜ 1 ⎟ B − Per → nn
⎜ 1 ⎟ O → nc
y B-Org — B-Per — O — O …
⎜ ⎟
| | | | ⎜ 1 ⎟ B − Org → 한나라당
한나라당/nn 조해진/nn 대변인/nc 은/jc … ⎜ ⎟
⎝  ⎠ 

  It can be proved that ME solution p* must have the
form: k
1 ⎡ ⎤
p( y | x) = exp⎢∑ λi f i ( x, y )⎥
Z ( x) ⎣ i =1 ⎦
⎡ k ⎤
Z ( x) = ∑ exp⎢∑ λi f i ( x, y )⎥
y ⎣ i =1 ⎦
  So our task is to estimate parameters λi in p* which
maximize H(p)
◦  When the problem is complex, we need to find a way to aut
omatically derive λi, given a set of constraints

9

Lagrange multiplier

Solve:

  Probabilistic conditional models generalizing
MEMMs.
  Whole sequence rather than per-state normal
ization.
  Convex likelihood function.

  CRFs use the observation-dependent normalization Z
(x) for the conditional distributions:
1 ⎛ ⎞
pθ (y | x) = exp ⎜ ∑ λk f k (e, y |e , x) + ∑ µk gk (v, y |v , x) ⎟
Z (x) ⎝ e∈E,k v∈V ,k ⎠

  Part-Of-Speech tagging experiments

14

  Mallet
◦  http://mallet.cs.umass.edu/index.php/Main_Page
◦  Java, GIS, L-BFGS, Feature induction, …

  FlexCRFs
◦  http://www.jaist.ac.jp/~hieuxuan/flexcrfs/flexcrfs.html
◦  C++, L-BFGS, Constrained CRF

  CRF++: Yet Another CRF toolkit
◦  http://chasen.org/~taku/software/CRF++/
◦  C++, L-BFGS, n-best output

  Our CRF toolkit
◦  ETRI 지식마이닝 연구팀
◦  C++, L-BFGS, n-best output
◦  Structural SVM
  Bundle method, Pegasos, FSMO 등의 학습 알고리즘 지원
  Domain Adaptation 기능 지원

15

Lagrange multiplier (KKT condition)

Tool: SVM-light, SVM-Perf, LIBSVM (SMO algorithm)

n
1 2 C n 1 2 C
min w,ξ w + ∑ ξi min w,ξ w + ∑ξ i , s.t. ∀i, ξ i ≥ 0
2 n i 2 n i

s.t. ∀i, ξ i ≥ 0 ∀i, ∀y ∈ Y y i : w T δΨi (x i , y ) ≥ L(y i , y ) − ξ i
∀i, yi (w ⋅ xi + b) ≥ 1 − ξi
δΨi (xi , y) = Ψ(xi , y i ) − Ψ(xi , y)

정답열 오답열

n
1 n n max α L(y i , y )
max α ∑ α i − ∑∑ α iα j x i ⋅ x j ∑α iy
i 2 i i i ,y ≠ y i

n
C 1
s.t. ∑ yiα i = 0 , ∀i,0 ≤ α i ≤ , α i ≥ 0 − ∑ ∑α iy α jyδΨi (x i , y ) ⋅ δΨ j (x j , y ),
i n 2 i ,y ≠ y i j ,y ≠ y j

n C
w = ∑ yiα i x i s.t. ∀i, 0 ≤ ∑ α iy ≤ , ∀iy , α iy > 0
i y≠yi n
w= ∑α iy δΨi (x i , y )
i ,y ≠ y i

Ψ (x, y )

구문 분석 예

x 한나라당/nn 조해진/nn 대변인/nc 은/jc … ⎛ 1 ⎞ B − Org → nn
⎜ ⎟
⎜ 0 ⎟ B − Org → nc
⎜ 0 ⎟ B − Org → jc
개체명 인식 예 ⎜ ⎟
Ψ (x, y ) = ⎜ 1 ⎟ B − Per → nn
⎜ 1 ⎟ O → nc
⎜ ⎟
y B-Org — B-Per — O — O … ⎜ 1 ⎟ B − Org → 한나라당
| | | |
⎜ ⎟
한나라당/nn 조해진/nn 대변인/nc 은/jc … ⎝  ⎠ 

ETRI journal

•  FSMO uses the fact that the formulati
on of structured SVM has no bias.
•  FSMO breaks down the QP problems o
f structural SVM into a series of small
est QP problems, each involving only
one variable.

1-Slack Cutting Plane Algorithm
n-slack structural SVMs
1 2 C n 1: Input: (x1,y1), …, (xn,yn), C, e
min w,ξ w + ∑ξ i , s.t. ∀i, ξ i ≥ 0 2: S Ø
2 n i

∀i, ∀y ∈ Y y i : w T δΨi (x i , y ) ≥ L(y i , y ) − ξ i 3: repeat
1
(w, ξ ) ← arg min w ,ξ >0 w T w + Cξ
2
ˆ ˆ
s.t.∀(y1 , , y n ) ∈ S :
4:
1 T n 1 n
1-slack structural SVMs n
ˆ ˆ
w ∑ δΨi (x i , y i ) ≥ ∑ L(y i , y i ) − ξ
i =1 n i =1
5: for i=1,…,n do
1 2
min w,ξ w + Cξ , s.t. ∀i, ξ ≥ 0 6: y i ← maxy∈Y {L(y i , y) + wT Ψ(xi , y)}
ˆ
2
1 T n
n 1 n 7: end for
ˆ ˆ ˆ ˆ
∀(y 1 ,..., y n ) ∈ Υ : w ∑ δΨi (x i , y i ) ≥ ∑ L(y i , y i ) − ξ .
n i =1 n i =1 8: ˆ ˆ
S ← S ∪{(y1 ,, y n )}
n
1 1 T n
ˆ ˆ
9: until n ∑ L(y i , y i ) − n w ∑ δΨi (x i , y i ) ≤ ξ + e
i =1 i =1

10: return (w, ξ)

  Stochastic Gradient Descent (SGD)
◦  Optimization algorithm for unconstrained optimizat
ion

  Pegasos
◦  Primal Estimated sub-GrAdient SOlver for SVM

,

Ψ (x, y )
  x = <아, 버, 지, 가, 방, 에, 들, 어, 가, 셨, 다, ‘.’>
  y = < B, I, I, I, B, I, B, I, I, I, I, I>

N-gram character feature
{xt, xt-1, xt+1, xt-2xt-1, xt-1xt, xtxt+1, xt+1xt+2, xt-3xt-2xt-1, xt-2xt-1xt, xt-1xtxt+1, xtxt+
1xt+2, xt+1xt+2xt+3} × yt

List lookup feature (list of nouns)
{lookup(xt-3xt-2…xt+3), lookup(xt-3xt-2…xt+2), lookup(xt-3xt-2…xt+1), lookup(xt-3xt-
2…xt), lookup(xt-3xt-2xt-1), lookup(xt-2xt-1…xt+3), lookup(xt-2xt-1…xt+2), lookup(xt-2
xt-1…xt+1), lookup(xt-2xt-1xt), lookup(xt-2xt-1), lookup(xt-1xt…xt+3), lookup(xt-1xt…x
t+2), lookup(xt-1xtxt+1), lookup(xt-1xt), lookup(xtxt+1…xt+3), lookup(xtxt+1xt+2), look
up(xtxt+1), lookup(xt+1xt+2xt+3), lookup(xt+1xt+2)} × yt
Character level regular expression feature
{normalize(xt-3), normalize(xt-3xt-2), normalize(xt-3xt-2xt-1), normalize(xt-2), norm
alize(xt-2xt-1), normalize(xt-2xt-1xt), normalize(xt-1), normalize(xt-1xt), normalize(xt-
1xtxt+1), normalize(xt), normalize(xtxt+1), normalize(xtxt+1xt+2), normalize(xt+1), nor
malize(xt+1xt+2), normalize(xt+1xt+2xt+3), normalize(xt+2), normalize(xt+2xt+3), norm
alize(xt+3)} × yt
Transition feature
yt-1 × yt

  실험 대상
  학습: 세종 코퍼스 원문 (2,600만 어절)
  평가: ETRI 품사 태그 말뭉치 원문 (29만 어절)
  50% 학습데이터 사용
  CRF 및 Structural SVM의 메모리 부족 문제

음절단위 정확 어절단위 정확
알고리즘
도 도
HMM [1] 98.44 93.46
CRF [2]* 98.84* 95.99*
CRF(50%학습) 98.47 93.97
S-SVM(50%학습) 98.51 94.11
Pegasos-struct(50%학습) 98.73 94.13
Pegasos-struct 99.01 95.47
Pegasos-struct +2nd-order 99.09 96.00

  * 평가셋이 다름

  실험에 사용된 모든 기계학습 알고리즘에 공통으로 사용됨
  자질
◦  어휘 자질: (-2,-1,0,1,2) 위치에 해당하는 형태소 어휘
◦  접미사(suffix) 자질: (-2,-1,0,1,2) 위치에 해당하는 형태소 어휘의 접미사(suffix)
◦  형태소 태그 자질: (-2,-1,0,1,2) 위치에 해당하는 형태소의 POS tag
◦  형태소 태그 + 길이 자질: 태그와 형태소의 길이 조합
◦  형태소의 어절 내 위치: 형태소가 어절의 시작, 중간, 끝 위치에 있는 지에 대한 정보
◦  개체명 사전 자질: 개체명 사전에 존재하는 지에 대한 정보
◦  개체명 사전 자질 + 형태소 길이 자질
◦  15개의 정규 표현식: [A-Z]*, [0-9]*, [0-9][0-9], [0-9][0-9][0-9][0-9], [A-Za-z0
-9]*, …
◦  N-gram class 기반의 word cluster 자질
◦  어휘망을 이용한 어휘 의미 자질
◦  Prediction history 자질

학습시간 정확도
기계학습 알고리즘 F1
(초) (accuracy)
CRFs (baseline) 16738 96.78 84.99

1-slack structural SVMs 11239 96.92 (+0.14) 85.14 (+0.15)

modified Pegasos 649 96.94 (+0.16) 85.43 (+0.44)

  TV 도메인(3,000문서): 2,900 학습, 100 테스트
  학습시간: CRFs 학습 시간의 4% (modified Pegasos)
  Paired t-test (유의수준 0.01, 신뢰도 99%)
  Modified Pegasos, structural SVMs > CRFs
  Modified Pegasos ~= structural SVMs (p값 0.69)

학습시간 정확도
기계학습 알고리즘 F1
(초) (accuracy)
CRFs (baseline) 14362 95.58 86.64

1-slack structural SVMs 5991 95.82 (+0.24) 86.86 (+0.22)

modified Pegasos 610 95.81 (+0.23) 86.79 (+0.15)

  스포츠 도메인 (3,500문서): 3,400 학습, 100 테스트
  학습시간: CRFs 학습 시간의 4% (modified Pegasos)
  Paired t-test (유의수준 0.001, 신뢰도 99.9%)
  Modified Pegasos, structural SVMs > CRFs
  Modified Pegasos ~= structural SVMs (p값 0.72)

[F2]자연어처리를 위한 기계학습 소개

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [F2]자연어처리를 위한 기계학습 소개

Similar to [F2]자연어처리를 위한 기계학습 소개 (12)

More from NAVER D2

More from NAVER D2 (20)

[F2]자연어처리를 위한 기계학습 소개