SlideShare a Scribd company logo
1 of 4
Download to read offline
์ œ 29 ํšŒ ํ•œ๊ตญ์ •๋ณด์ฒ˜๋ฆฌํ•™ํšŒ ์ถ˜๊ณ„ํ•™์ˆ ๋ฐœํ‘œ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘ ์ œ 15 ๊ถŒ ์ œ 1 ํ˜ธ (2008. 5)
1
๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ด์šฉํ•œ ๋ฌธ์žฅ๊ฒฝ๊ณ„์ธ์‹
๋ฐ•์ˆ˜ํ˜*, ์ž„ํ•ด์ฐฝ**
*๊ณ ๋ ค๋Œ€ํ•™๊ต ์ปดํ“จํ„ฐ ์ •๋ณดํ†ต์‹ ๋Œ€ํ•™์›
**๊ณ ๋ ค๋Œ€ํ•™๊ต ์ปดํ“จํ„ฐํ•™๊ณผ
e-mail : *park.suhyuk@gmail.com, **rim@nlp.korea.ac.kr
Sentence Boundary Detection Using Machine Learning Techniques
Su-Hyuk Park*, Hae-Chang Rim**
* Graduate School of Computer and Information Technology, Korea University
** Dept. of Computer Science and Engineering, Korea University
์š” ์•ฝ
๋ณธ ๋…ผ๋ฌธ์€ ์–ธ์–ด์˜ ํ†ต๊ณ„์  ํŠน์ง•์„ ์ด์šฉํ•˜์—ฌ ๋ฒ”์šฉ์˜ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•
์€ ๋Œ€๋Ÿ‰์˜ ์ฝ”ํผ์Šค ๋‚ด์—์„œ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋Š” ๋ฌธ์žฅ ๊ฒฝ๊ณ„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์Œ์ ˆ ๋ฐ ์–ด์ ˆ ๋“ฑ์˜ ์ž์งˆ์„ ์ด์šฉํ•˜์—ฌ
ํ†ต๊ณ„์  ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ณ  ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ๊ฒฝ๊ณ„๋ฅผ ์ธ์‹ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๋˜ํ•œ ํŠน
์ • ์–ธ์–ด๋‚˜ ๋„๋ฉ”์ธ์— ์ œํ•œ์ ์ด์ง€ ์•Š๊ณ  ๋ฒ”์šฉ์ ์ธ ์ž์งˆ๋งŒ์„ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•˜์˜€๋‹ค.
์–ธ์–ด์˜ ํŠน์„ฑ์ƒ ๋ฌธ์žฅ์˜ ๊ตฌ๋ถ„์ด ์• ๋งคํ•œ ๊ฒฝ์šฐ ๋˜๋Š” ์ž˜๋ชป ์‚ฌ์šฉ ๋œ ๊ตฌ๋‘์  ๋“ฑ์˜ ๊ฒฝ์šฐ์—๋„ ์ ์šฉ ๊ฐ€๋Šฅ
ํ•˜๋„๋ก ๋‹ค์–‘ํ•œ ์ž์งˆ์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜ํ•˜์˜€์œผ๋ฉฐ, ํ•œ๊ตญ์–ด์™€ ์˜๋ฌธ ์ฝ”ํผ์Šค์— ๋Œ€ํ•ด์„œ ๋™์ผํ•œ ์ž์งˆ์„ ์ ์šฉ
ํ•˜์—ฌ ์‹คํ—˜ํ•˜์—ฌ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ์ž์งˆ๋“ค์ด ํ•œ๊ตญ์–ด ๋ฐ ๋‹ค๋ฅธ ์–ธ์–ด๊ถŒ์˜ ์–ธ์–ด์—๋„ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๋ฒ”
์šฉ์ ์ธ ์ž์งˆ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์„ ์œ„ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๋ฐ ์‹คํ—˜์„ ์œ„ํ•ด์„œ ์„ธ์ข…๊ณ„ํš ์ฝ”ํผ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์„ฑ
๋Šฅ์ฒ™๋„๋กœ๋Š” ์ •ํ™•๋ฅ ๊ณผ ์žฌํ˜„์œจ์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์‹คํ—˜๊ฒฐ๊ณผ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ 99%์˜ ์ •ํ™•๋ฅ ๊ณผ 99.2%์˜
์žฌํ˜„์œจ์„ ๋ณด์˜€๋‹ค. ์˜๋ฌธ์˜ ๊ฒฝ์šฐ๋Š” Wall Street Journal ์ฝ”ํผ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๋™์ผํ•œ ์ž์งˆ์„ ์ ์šฉํ•˜์—ฌ
์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ 98.9%์˜ ์ •ํ™•๋ฅ ๊ณผ 94.6%์˜ ์žฌํ˜„์œจ์„ ๋ณด์˜€๋‹ค.
1. ์„œ๋ก 
โ€˜๋ฌธ์žฅโ€™์˜ ์‚ฌ์ „์ ์ธ ์˜๋ฏธ๋Š” '์˜์‚ฌ๋ฅผ ์ „๋‹ฌํ•˜๋Š” ์ตœ์†Œ์˜
๋‹จ์œ„'๋กœ ์ •์˜๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ „ํ†ต ๋ฌธ๋ฒ•์—์„œ๋Š” '๋น„๊ต์  ์™„
์ „ํ•˜๊ณ  ๋…๋ฆฝ๋œ ์˜์‚ฌ์ „๋‹ฌ ๋‹จ์œ„๋‹ค'๋ผ๊ณ  ์ •์˜ํ•˜๊ณ  ์žˆ์œผ
๋ฉฐ, ์ด๋Ÿฌํ•œ ๋ฌธ์žฅ์˜ ๋‹จ์œ„๊ฐ€ ๋Œ€๋ถ€๋ถ„์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋„
๊ตฌ, ํ’ˆ์‚ฌ๋ถ€์ฐฉ๊ธฐ ๋“ฑ์˜ ๊ธฐ๋ณธ๋‹จ์œ„๊ฐ€ ๋˜๊ณ  ์žˆ์œผ๋ฉฐ, ์Œ์„ฑ
์ธ์‹๋œ ๋ฌธ์žฅ ๋˜๋Š” OCR ์ฒ˜๋ฆฌ๋œ ๋ฌธ์„œ ๋“ฑ์˜ ๋ฌธ์žฅ๊ฒฝ๊ณ„๊ฐ€
๋ชจํ˜ธํ•œ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ๋„ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์œผ๋กœ๋„ ๋ฌธ์žฅ๊ฒฝ๊ณ„
์˜ ์ธ์‹์ž‘์—…์ด ๋ฐ˜๋“œ์‹œ ์š”๊ตฌ๋œ๋‹ค.
ํ•˜์ง€๋งŒ ์‹ค๋ก€์—์„œ๋Š” ๋ฌธ์žฅ๊ฒฝ๊ณ„์˜ ๊ตฌ๋ถ„์ด ๋ช…ํ™•ํ•˜์ง€ ๋ชป
ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ด ๋ณด์ด๋ฉฐ, ์‚ฌ๋žŒ์ด ๋ณด์•„์„œ๋„ ์ œ๋Œ€๋กœ ๊ตฌ
๋ถ„๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋„ ๋งŽ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ,
๋Š๋‚Œํ‘œ ๋“ฑ์˜ ๊ธฐํ˜ธ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฌธ์žฅ๊ฒฝ๊ณ„๊ฐ€ ๊ตฌ๋ถ„๋œ๋‹ค๊ณ 
๋ณผ ์ˆ˜ ์žˆ์œผ๋‚˜, "๋ฌธ์žฅ์˜ ์‹œ์ž‘๋ฒˆํ˜ธ, e-mail, ์˜์–ด์˜ ์•ฝ์–ด,
๊ฐ•์กฐ๋ฅผ ์œ„ํ•œ ๋ฐ˜๋ณต, ์—ด๊ฑฐ ๋“ฑ์— ์‚ฌ์šฉ๋˜๋Š” ๋งˆ์นจํ‘œ ๋˜๋Š”
์ž˜๋ชป ์‚ฌ์šฉ๋œ ๊ตฌ๋‘์ ์˜ ๊ฒฝ์šฐ์™€ ๊ฐ™์ด ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ
๋„ ๋งŽ๋‹ค.
๋”ฐ๋ผ์„œ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์˜ ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•์€ ๊ฐ ๋ถ„์•ผ
์˜ ์ „๋ฌธ๊ฐ€๊ฐ€ ํ•ด๋‹น ์œ„์น˜๋ฅผ ์ธ์‹ํ•˜๋Š” ๊ทœ์น™์„ ์ •ํ•˜๋Š” ๊ฒƒ
์ด ๊ฐ€์žฅ ์ด์ƒ์ ์ผ ์ˆ˜ ์žˆ์œผ๋‚˜, ๋„ˆ๋ฌด ๋งŽ์€ ๋น„์šฉ์ด ์†Œ๋ชจ
๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜์–ด ์ถ”์ฒœํ•  ๋งŒํ•œ ๋ฐฉ๋ฒ•์ด ์•„๋‹ˆ๋ฉฐ, ๋ณธ
๋…ผ๋ฌธ์—์„œ๋Š” ๋Œ€๋Ÿ‰์˜ ์ฝ”ํผ์Šค์—์„œ ์ถ”์ถœํ•œ ๋ฌธ์žฅ๊ฒฝ๊ณ„์˜ ํ›„
๋ณด๋กœ ์‚ผ์„ ์ˆ˜ ์žˆ๋Š” ๊ตฌ๋‘์ ์„ ์ถ”์ถœํ•˜๊ณ  ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•
์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ๊ฒฝ๊ณ„๋กœ ์ธ์‹๋˜๋Š” ์œ„์น˜๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜
์žˆ๋Š” ๋‹ค์–‘ํ•œ ์ž์งˆ์„ ํ•™์Šตํ•˜๊ณ , ํ–ฅํ›„ ์ƒˆ๋กœ์šด ๋ฌธ์„œ์—
๋Œ€ํ•˜์—ฌ ๋ฌธ์žฅ๊ฒฝ๊ณ„๋ฅผ ํŒ๋‹จํ•ด๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹
๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค.
๋‹ค์–‘ํ•œ ์–ธ์–ด์— ๋Œ€ํ•œ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์€ ์ตœ๋Œ€ํ•œ ๋งŽ์€
์–ธ์–ด์— ๋Œ€ํ•œ ์‹คํ—˜์ด ํ•„์š”ํ•˜๊ฒ ์œผ๋‚˜, ์ฝ”ํผ์Šค ๋˜๋Š” ํ•ด๋‹น
์–ธ์–ด์— ๋Œ€ํ•œ ์ดํ•ด์˜ ๋ถ€์กฑ์œผ๋กœ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•œ๊ตญ์–ด
๋ฐ ์˜์–ด์— ๋Œ€ํ•ด์„œ๋งŒ ์‹คํ—˜ํ•˜๊ธฐ๋กœ ํ•˜์˜€๋‹ค.
๊ธฐ๊ณ„ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋„๊ตฌ๋กœ๋Š” WEKA Workbench ๋ผ๋Š”
๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ ํˆดํ‚ท์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์ œ๊ณต๋˜๋Š” ๋‹ค์–‘ํ•œ
๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์˜ํ•ด์„œ
์ •ํ™•๋ฅ ๊ณผ ์žฌํ˜„์œจ์ด ํฌ๊ฒŒ ์ฐจ์ด ๋‚˜๋Š” ๊ฒฝ์šฐ๋Š” ์—†์—ˆ์œผ๋‚˜,
์„ฑ๋Šฅ ๋ฉด์—์„œ Random Forest ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ
๋Šฅ์„ ๋ณด์˜€๊ณ , ํ•™์Šต์— ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค๋Š” ์›์‹œ ์ฝ”ํผ์Šค๋งŒ
์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์–ด์ ˆ๊ธฐ์ค€์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๊ณ  ๋‹ค์‹œ ์–ด์ ˆ
์—์„œ ๊ตฌ๋‘์  ํ›„๋ณด๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด, ๊ตฌ๋‘์ ์„ ํฌํ•จํ•˜์—ฌ ๋ณ„
๋„์˜ ํ† ํฐ์œผ๋กœ ๋ถ„๋ฆฌํ•ด ๋‚ด์–ด ๋ถ„์„์„ ์‹œ๋„ํ•˜์˜€๋‹ค.
2. ๊ด€๋ จ์—ฐ๊ตฌ
Riley, Michael D. (1989)๋Š” ๊ตฌ๋‘์ ์ด ๋ฐœ์ƒํ•˜๋Š” ์ฃผ๋ณ€
๋‹จ์–ด์˜ ์ถœํ˜„ํ™•๋ฅ  ๋ฐ ๊ตฌ๋‘์ ์ด ๋ฐœ๊ฒฌ๋œ ์–ด์ ˆ์˜ ํด๋ž˜์Šค
๋“ฑ์˜ ์ž์งˆ์„ ์ถ”์ถœํ•˜์˜€๊ณ , ํ™•๋ฅ ์ •๋ณด๋Š” AP News 2 ์ฒœ 5
๋ฐฑ๋งŒ ๋‹จ์–ด๋ฅผ ํ†ตํ•ด ๊ตฌํ•˜์˜€์œผ๋ฉฐ, Brown ์ฝ”ํผ์Šค ์—์„œ
Decision Tree (C4.5)๋ฅผ ์ด์šฉํ•œ ๊ฒฐ๊ณผ 99.8%์˜ ์ •ํ™•๋ฅ ์„
๋ณด์˜€๋‹ค.
์ œ 29 ํšŒ ํ•œ๊ตญ์ •๋ณด์ฒ˜๋ฆฌํ•™ํšŒ ์ถ˜๊ณ„ํ•™์ˆ ๋ฐœํ‘œ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘ ์ œ 15 ๊ถŒ ์ œ 1 ํ˜ธ (2008. 5)
2
David D. Palmer, Marti A. Hearst, (1994)๋Š” ๊ตฌ๋‘์  ์ฃผ
๋ณ€์˜ ๋‹จ์–ด์— ๋Œ€ํ•œ ํ’ˆ์‚ฌ์˜ ํ™•๋ฅ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ 20 ๊ฐ€
์ง€ ์ •๋„์˜ ํ† ํฐ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜์˜€๊ณ , Feed-forward Neural
Network ๋ฅผ ์ด์šฉํ•œ ๊ฒฐ๊ณผ 98.5%์˜ ์ •ํ™•๋ฅ ์„ ๋ณด์˜€๋‹ค.
Jeffrey C. Reynar and Adwait Ratnaparkhi, (1997)๋Š” ๊ตฌ
๋‘์  ํ›„๋ณด๊ฐ€ ๋ฐœ์ƒํ•œ ์•ž/๋’ค ํ† ํฐ์˜ ํ™•๋ฅ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜
์˜€์œผ๋ฉฐ, Maximum Entropy ๊ธฐ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ Wall Street
Journal ๋ฐ Brown ์ฝ”ํผ์Šค์—์„œ ๊ฐ๊ฐ 98.0%, 97.5%์˜ ์ •
ํ™•๋ฅ ์„ ๋ณด์˜€๋‹ค.
์ž„ํฌ์„, ํ•œ๊ตฐํฌ (2004)๋Š” ํ›„๋ณด ๊ตฌ๋‘์  ์ž์ฒด์˜ ํ™•๋ฅ ,
์•ž/๋’ค ๋ฐœ์ƒํ•˜๋Š” ์Œ์ ˆ ๊ทธ๋ฆฌ๊ณ  ์ธ์šฉ๋ถ€ํ˜ธ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ž์งˆ
๋กœ ์ด์šฉํ•˜์˜€์œผ๋ฉฐ, kNN ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ETRI, KAIST ์ฝ”
ํผ์Šค์—์„œ ๊ฐ๊ฐ 96.73%, 98.64%์˜ ์ •ํ™•๋ฅ ์„ ๋ณด์˜€๋‹ค. ๋˜
ํ•œ ๋‘ ์ฝ”ํผ์Šค๋ฅผ ๋ชจ๋‘ ํ•™์Šตํ•œ ๊ฒฝ์šฐ์—๋Š” 98.82%์˜ ์ •ํ™•
๋ฅ ์„ ๋ณด์˜€๋‹ค.
์ด์™€ ๊ฐ™์ด ์˜์–ด์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋Š” ๋งŽ์ด ์ด๋ฃจ์–ด์กŒ์œผ๋‚˜
ํ•œ๊ตญ์–ด์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋Š” ์•„์ง ๋งŽ์ง€ ์•Š์€ ์‹ค์ •์ด๋ฉฐ, ๋˜
ํ•œ ๊ธฐ์กด์— ์—ฐ๊ตฌ๋œ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฒฝ๊ณ„์ธ์‹์˜ ๋…ผ๋ฌธ์—์„œ๋„
์–ธ๊ธ‰ํ•˜์˜€๋“ฏ์ด ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฒฝ๊ณ„์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ
์ฆ˜์„ ํ†ตํ•œ ์‹คํ—˜์ด ํ•„์š”ํ•˜๋‹ค. ๋˜ํ•œ ๊ฐœ๋ณ„ ์–ธ์–ด์— ๋Œ€ํ•ด
์„œ๋งŒ ์‹คํ—˜ ๋ฐ ๊ฒ€์ฆ์ด ์ด๋ฃจ์–ด์ง€๊ณ , ํ•™์Šต์— ์‚ฌ์šฉ๋œ ์ž
์งˆ์ด ๋‹ค๋ฅธ ๋„๋ฉ”์ธ ๋˜๋Š” ๋‹ค๋ฅธ ๊ณ„ํ†ต ์–ธ์–ด์— ๋Œ€ํ•œ ์‹คํ—˜
๋ฐ ๊ฒ€์ฆ์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ฒ ๋‹ค.
ํ•˜์ง€๋งŒ ์ตœ๊ทผ ์›น ์ž๋ฃŒ ๋ฐ ๋ฌธ์„œ์˜ ๊ฒฝ์šฐ ๋‹ค์–‘ํ•œ ์–ธ์–ด
์˜ ์ž๋ฃŒ๊ฐ€ ๋งŽ์ด ๋ฐœ์ƒํ•˜๋ฉฐ, ๊ทธ๋Ÿฌํ•œ ๋ฒ”์šฉ์ ์ธ ์ž์งˆ์—
๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐ๋˜๋ฉฐ, ์ด์— ๋ฌธ์žฅ๊ฒฝ๊ณ„์—
ํ•„์š”ํ•œ ๋‹ค์–‘ํ•œ ์ž์งˆ์„ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ํ•™์Šต
ํ•˜๊ณ  ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฒฝ๊ณ„ ๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผ ์˜์–ด์— ๋Œ€ํ•˜์—ฌ
๋„ ์‹คํ—˜์„ ํ•˜๊ณ ์ž ํ•œ๋‹ค.
3. ๋ฌธ์žฅ ๊ฒฝ๊ณ„์ธ์‹
๊ฐ€. ์ฝ”ํผ์Šค ์ •์ œ ๋ฐ ๊ตฌ์กฐํ™”
1) ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ๋ฌธ์ž์—ด ์ธ์ฝ”๋”ฉ์„ EUC-KR
์—์„œ UTF-8 ์œผ๋กœ ๋ณ€ํ™˜
2) ๊นจ์ง„ ๋ฌธ์ž์—ด ๋ฐ ์ค‘๋ณต๋œ ์˜ค๋ถ„์„ ๊ฒฐ๊ณผ ์ œ๊ฑฐ
3) ๋ฌธ์žฅ, ์–ด์ ˆ, ํ† ํฐ๋‹จ์œ„๋กœ ์ €์žฅ
4) ์ „์ฒด ํ•™์Šต ์ฝ”ํผ์Šค๋ฅผ ๊ตฌ์กฐํ™” ํ•˜์—ฌ ์ €์žฅ
๋‚˜. ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ํ†ต๊ณ„์ •๋ณด ์ถ”์ถœ
1) ์ฝ”ํผ์Šค ๋‚ด์˜ ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„๋กœ ์ธ์‹๋œ ์œ„์น˜
์˜ ๊ตฌ๋‘์ (ํŠน์ˆ˜๋ฌธ์ž ํฌํ•จ)์˜ ๋นˆ๋„์ˆ˜
2) ๊ตฌ๋‘์  ํ›„๋ณด ์•ž/๋’ค์˜ ํ† ํฐ์œ ํ˜•(๋ณธ ๋…ผ๋ฌธ์—์„œ
๋ถ„๋ฅ˜ํ•œ ํ•œ๊ตญ์–ด์—์„œ ๋ฐœ์ƒํ•˜๋Š” 6 ๊ฐ€์ง€ ํ† ํฐ
์œ ํ˜•)์˜ ๋นˆ๋„์ˆ˜
๋‹ค. ๋ฌธ์žฅ๊ฒฝ๊ณ„๋ฅผ ํŒ๋ณ„ํ•˜๊ธฐ ์œ„ํ•œ ํ†ต๊ณ„์ •๋ณด ์ถ”์ถœ
1) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์œ„์น˜์—์„œ ๊ตฌ๋‘์ ์˜ ์œ 
ํ˜•์— ๋”ฐ๋ฅธ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜
2) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์•ž/๋’ค ํ† ํฐ์— ๋Œ€ํ•˜์—ฌ
ํ† ํฐ์œ ํ˜•์— ๋”ฐ๋ฅธ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜
3) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์•ž/๋’ค 1 ์Œ์ ˆ์— ๋Œ€ํ•œ
๋ฌธ์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜
4) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์•ž/๋’ค ์–ด์ ˆ์— ๋Œ€ํ•œ ๋ฌธ
์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜
5) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์•ž/๋’ค ํ† ํฐ์— ๋Œ€ํ•œ ๋ฌธ
์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜
4. ํ†ต๊ณ„์  ์ž์งˆ
๊ฐ€. ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์œ„์น˜์˜ ์„ ํƒ
๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„์—์„œ ๋ฐœ์ƒํ•œ ์Œ์ ˆ ๋˜๋Š” ๋ฌธ์ž์—
๋Œ€ํ•œ ํ†ต๊ณ„์ •๋ณด๋ฅผ ํ†ตํ•˜์—ฌ, ๊ฐ€์žฅ ๋งŽ์ด ๋ฐœ์ƒํ•œ ์ƒ
์œ„ 99.9%์˜ ์Œ์ ˆ ๋˜๋Š” ๋ฌธ์ž๊ฐ€ ์ถœํ˜„ํ•˜๋Š” ์œ„์น˜๋ฅผ
๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋กœ ๋ณด์•˜์œผ๋ฉฐ, ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ์ด
5 ๊ฐ€์ง€์˜ ๊ตฌ๋‘์ (๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ, ๋Š๋‚Œํ‘œ, ํฐ/์ž‘
์€๋”ฐ์˜ดํ‘œ)์ด ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋กœ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์˜
์–ด์˜ ๊ฒฝ์šฐ ์ด 3 ๊ฐ€์ง€์˜ ๊ตฌ๋‘์ (๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ,
ํฐ ๋”ฐ์˜ดํ‘œ)์ด ํ›„๋ณด๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.
ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ๊ฐ€ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด์ข…๋ฅ˜๊ฐ€ ๋” ๋งŽ
์œผ๋ฏ€๋กœ ํ•œ๊ตญ์–ด๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํ‘œํ˜„ ํ•˜์˜€๋‹ค.
๋‚˜. ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋ฅผ ์œ„ํ•œ ๊ตฌ๋‘์ 
1) ๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ, ๋Š๋‚Œํ‘œ, ํฐ/์ž‘์€๋”ฐ์˜ดํ‘œ
2) ํ›„๋ณด๊ฐ€ ๋ฐœ์ƒํ•œ ๊ฒฝ์šฐ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋กœ ์ธ์‹.
๋‚˜๋Š” ํ•™๊ต์— ๊ฐ‘๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ง‘์œผ๋กœ ์˜ต๋‹ˆ๋‹ค.
โ†‘ โ†‘
๋‹ค. ๊ตฌ๋‘์ ํ›„๋ณด ์•ž/๋’ค์˜ ํ† ํฐ์œ ํ˜•
1) ํ•œ๊ธ€/ ์™ธ๊ตญ์–ด/ ์ˆซ์ž/ ํŠน์ˆ˜๋ฌธ์ž
2) ์œ ํ˜• 1 (๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ, ๋Š๋‚Œํ‘œ)
3) ์œ ํ˜• 2 (์‰ผํ‘œ, ์ฝœ๋ก , ๋น—๊ธˆ)
4) ์œ ํ˜• 3 (๋”ฐ์˜ดํ‘œ, ๊ด„ํ˜ธ, ์ค„์ž„ํ‘œ, ๋ถ™์ž„ํ‘œ)
5) ์œ ํ˜• 4 (๊ธฐํƒ€ ํŠน์ˆ˜๋ฌธ์ž)
6) ์œ ํ˜• 5 (์œ ํ˜•์— ํฌํ•จ๋˜์ง€ ์•Š๋Š” ๋ฌธ์ž์—ด)
๋‹น์ผ ์ฃผ๊ฐ€๋Š” 13.3% ๋–จ์–ด์ง„ 9.96 ๋‹ฌ๋Ÿฌ๊ฐ€ ๋๋‹ค.
์ˆซ์ž/./๊ธฐํ˜ธ 4 ์ˆซ์ž/./์ˆซ์ž ํ•œ๊ธ€/.
๋ผ. ๊ตฌ๋‘์ ํ›„๋ณด์˜ ์•ž/๋’ค์˜ ์•ž/๋’ค์˜ 1 ์Œ์ ˆ
1) ํ†ต๊ณ„์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐœ์ƒํ•  ํ™•๋ฅ ์„ ์ด์šฉ
์ƒํ’ˆ์„ ๋‚ด๋†“๋Š”๋‹ค. ๋‚ด๋‹ฌ๋ถ€ํ„ฐ๋Š” ๊ฐ€์กฑ๊ฐ„
๋‹ค/./๋‚ด
๋งˆ. ๊ตฌ๋‘์ ํ›„๋ณด ์•ž/๋’ค์˜ ํ† ํฐ ๋ฌธ์ž์—ด
1) ํ†ต๊ณ„์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐœ์ƒํ•  ํ™•๋ฅ ์„ ์ด์šฉ
์žก์ง€ ์•Š์•˜์œผ๋ฉด ํ•œ๋‹ค"๋ฉฐ "CNN ๊ณผ ํ•จ๊ป˜ ๋Œ€์„ 
ํ•œ๋‹ค/"/๋ฉฐ ๋ฉฐ/"/CNN
๋ฐ”. ๊ตฌ๋‘์ ํ›„๋ณด ์•ž/๋’ค์˜ ํ† ํฐ์˜ ๊ธธ์ด
1) ๊ธธ์ด๋Š” ์Œ์ ˆ์˜ ๊ฐœ์ˆ˜
57.7%๋กœ ์ง€๋‚œ 2003 ๋…„ 12 ์›”(104.4%) ์ดํ›„
2/./7 3/./1
์‚ฌ. ๊ตฌ๋‘์ ํ›„๋ณด ์ด์ „/๋‹ค์Œ ๊ตฌ๋‘์  ํ›„๋ณด๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ
1) ๊ฑฐ๋ฆฌ๋Š” ์‚ฌ์ด์— ์กด์žฌํ•˜๋Š” ํ† ํฐ์˜ ์ˆ˜
0.6 -2.5 ์™€ํŠธ์˜ TDP ๊ทœ๊ฒฉ์„ ๋”ฐ๋ฅด๋ฉฐ 1.8GHz ๊นŒ์ง€
1/./3 3/./6 6/./3
5. ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ• ์„ ํƒ
๊ฐ€. ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์œผ๋กœ ์‹คํ—˜
1) ์ถฉ๋ถ„ํžˆ ์ž‘์€ ํ•™์Šต์ง‘ํ•ฉ ์ƒ˜ํ”Œ๋ง
ํ•ด๋‹น ์ฝ”ํผ์Šค์— ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚œ ๊ธฐ๊ณ„ํ•™์Šต
๊ธฐ๋ฒ• ์„ ํƒ์„ ์œ„ํ•จ์ด๋ฏ€๋กœ, ๊ฒฐ๊ณผ์— ์™œ๊ณก์ด ๋ฐœ์ƒ
ํ•˜์ง€ ์•Š์„ ์ •๋„์˜ ์ž‘์€ ์ง‘ํ•ฉ์„ ์ƒ˜ํ”Œ๋ง ํ•˜์˜€๋‹ค.
2) ์ถ”์ถœ๋œ ํ•™์Šต์ง‘ํ•ฉ์„ ํ†ตํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ
์ œ 29 ํšŒ ํ•œ๊ตญ์ •๋ณด์ฒ˜๋ฆฌํ•™ํšŒ ์ถ˜๊ณ„ํ•™์ˆ ๋ฐœํ‘œ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘ ์ œ 15 ๊ถŒ ์ œ 1 ํ˜ธ (2008. 5)
3
์ฆ˜์— ๋Œ€ํ•˜์—ฌ ์‹คํ—˜์„ ์ˆ˜ํ–‰
3) ์‹คํ—˜ ๋Œ€์ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘์—์„œ ์ƒ์œ„ 2 ๊ฐœ์˜
์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ ํƒ
4) ์ถฉ๋ถ„ํžˆ ํฐ ํ•™์Šต์ง‘ํ•ฉ์— ๋Œ€ํ•˜์—ฌ ๋‹ค์‹œ ์‹คํ—˜
์„ ํƒํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ „์ฒด ํ•™์Šต์ง‘ํ•ฉ์—์„œ๋„ ์ž˜
๋™์ž‘ํ•˜๋Š” ์ง€๋ฅผ ์œ„ํ•œ ์‹คํ—˜์„ ๋‹ค์‹œ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค.
๋‚˜. 1 ์ฐจ ์‹คํ—˜๊ฒฐ๊ณผ
์šฐ์„  1 ์ฐจ ์‹คํ—˜์œผ๋กœ์จ ์ถฉ๋ถ„ํžˆ ์ž‘์€ ํ•™์Šต์ง‘ํ•ฉ์œผ
๋กœ ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ์‹คํ—˜ํ•ด ๋ณด๊ณ , ๊ฐ€์žฅ
์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚œ 2 ๊ฐœ์˜ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์œผ๋กœ 2 ์ฐจ
์‹คํ—˜์„ ํ•˜๊ณ ์ž ํ•œ๋‹ค.
1) ์ƒ˜ํ”Œ๋ง ํ•™์Šต์ง‘ํ•ฉ ํ•™์Šต ํ›„ ์‹คํ—˜๊ฒฐ๊ณผ
Classifier Precision Recall f-measure
BayesNet 0.962 0.917 0.939
NaiveBayes 0.954 0.878 0.915
Logistic 0.965 0.977 0.971
Multilayer
Perceptron
0.957 0.97 0.964
kNN 0.954 0.963 0.958
AdaBoost 0.934 0.953 0.944
Bagging 0.966 0.984 0.975
Dagging 0.952 0.963 0.957
Decorate 0.967 0.966 0.966
LogitBoost 0.947 0.95 0.949
ADTree 0.942 0.964 0.953
J48(C4.5) 0.97 (1st
) 0.969 0.964
Random
Forest
0.968 (2nd
) 0.981 0.974
REPTree 0.96 0.983 0.971
2) ๊ฒฐ๊ณผ๋ถ„์„
โ‘  ์ด 885 ๊ฐœ์˜ ์–ด์ ˆ (0.1%)์„ ํ•™์Šต
โ‘ก ์ •ํ™•๋ฅ ๊ธฐ์ค€ ์ƒ์œ„ 2 ๊ฐœ์˜ Classifier ์„ ํƒํ•˜
์˜€์œผ๋ฉฐ, ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์˜ ๊ฒฝ์šฐ ์žฌํ˜„์œจ
๋ณด๋‹ค๋Š” ์ •ํ™•๋ฅ ์ด ์ข€ ๋” ์˜๋ฏธ ์žˆ๋Š” ์ง€ํ‘œ
๋ผ๊ณ  ํŒ๋‹จํ•˜์˜€๋‹ค.
๋‹ค. 2 ์ฐจ ์‹คํ—˜๊ฒฐ๊ณผ
1) 30% ์ƒ˜ํ”Œ๋ง ํ•™์Šต์ง‘ํ•ฉ ํ•™์Šต ํ›„ ์‹คํ—˜๊ฒฐ๊ณผ
Classifier Precision Recall f-measure
Decision
Tree C4.5
0.979 0.98 0.98
Random
Forest
0.983 0.986 0.985
2) 70% ์ƒ˜ํ”Œ๋ง ํ•™์Šต์ง‘ํ•ฉ ํ•™์Šต ํ›„ ์‹คํ—˜๊ฒฐ๊ณผ
Classifier Precision Recall f-measure
Decision
Tree C4.5
0.98 0.989 0.984
Random
Forest
0.991 0.992 0.991
3) ๊ฒฐ๊ณผ๋ถ„์„
โ‘  265,529 ์–ด์ ˆ(30%), 619,567 ์–ด์ ˆ(70%)
โ‘ก 1,000 ๊ฐœ๊ฐ€ ๋˜์ง€ ์•Š๋Š” ํ•™์Šต์ง‘ํ•ฉ๊ณผ ๊ทธ์˜
1,000 ๋ฐฐ๊ฐ€ ๋„˜๋Š” ํ•™์Šต์ง‘ํ•ฉ๊ณผ ์ •ํ™•๋ฅ  ,
์žฌํ˜„์œจ ๋ชจ๋‘ 3%์ •๋„ ๋ฐ–์— ํ–ฅ์ƒ์ด ์—†
์—ˆ๋‹ค.
๋ผ. ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ• ์„ ํƒ
1) ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์ƒ Random Forest ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ
๋ชจ๋“  ํ•™์Šต์ง‘ํ•ฉ์„ ํ•™์Šต ์‹œํ‚ฌ ์ˆ˜๋Š” ์—†์—ˆ์œผ๋ฉฐ,
70%์ •๋„ (์•ฝ 60 ๋งŒ ์–ด์ ˆ)์˜ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ
๊ฐ€์žฅ ๋‚ณ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํŒ๋‹จ๋จ
6. ์‹คํ—˜๊ฒฐ๊ณผ
๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•œ๊ธ€์˜ ๊ฒฝ์šฐ ์„ธ์ข…์ฝ”ํผ์Šค 100 ๋งŒ
์–ด์ ˆ ์ค‘ 70 ๋งŒ ์–ด์ ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ 99.1%์˜ ์ •ํ™•๋ฅ 
์„ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, Spoken ๊ณผ Written ์„ ๋น„์œจ
์„ ๋™์ผํ•˜๊ฒŒ ํ•˜์—ฌ ์‹คํ—˜์„ ์‹ค์‹œํ•˜์˜€์œผ๋ฉฐ, ํ•™์Šต ๋ฐ
์‹คํ—˜์€ 10-fold cross validation ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
์˜๋ฌธ ์ฝ”ํผ์Šค์ธ Wall Street Journal ์˜ ๊ฒฝ์šฐ๋Š” ์ด
110 ๋งŒ ์–ด์ ˆ ๋ชจ๋‘๋ฅผ ์‚ฌ์šฉํ•˜์˜€๊ณ  ๋™์ผํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ
์‹คํ—˜์„ ์‹ค์‹œํ•˜์˜€๋‹ค.
ํ‰๊ฐ€์ฒ™๋„๋กœ๋Š” ๋ฌธ์žฅ ์ •ํ™•๋ฅ (Precision), ๋ฌธ์žฅ ์žฌํ˜„
์œจ(Recall) ๋ฐ F-measure ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์ฒ™๋„์˜
์ •์˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
์ˆ˜๋ฌธ์žฅ์ถ”์ถœํ•œ์‹œ์Šคํ…œ์ด
์ˆ˜๋ฌธ์žฅ์ •๋‹ต์ถ”์ถœํ•œ์‹œ์Šคํ…œ์ด
Pr =ecision
์ˆ˜๋ฌธ์žฅ์ „์ฒด
์ˆ˜๋ฌธ์žฅ์ •๋‹ต์ถ”์ถœํ•œ์‹œ์Šคํ…œ์ด
Re =call
callecision
callecision
measureF
RePr
Re*Pr*2
+
=โˆ’
๊ฐ€. ์„ธ์ข…๊ณ„ํš ์ฝ”ํผ์Šค์— ๋Œ€ํ•œ ์‹คํ—˜
Spoken ๊ณผ Written ๋ชจ๋‘ ํ•™์Šต์— ํฌํ•จ์‹œ์ผฐ์œผ๋ฉฐ,
์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚œ Decision Tree ์™€
Random Forest ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ ํƒํ•˜์—ฌ ์‹คํ—˜ํ•˜์˜€๋‹ค
1) ์‹คํ—˜๊ฒฐ๊ณผ
Classifier Precision Recall f-measure
Decision Tree 0.98 0.989 0.984
Random Forest 0.991 0.992 0.991
โ‘  Spoken ์ฝ”ํผ์Šค์™€ Written ์ฝ”ํผ์Šค์— ๋Œ€ํ•˜
์—ฌ 90%์ •๋„๋ฅผ Random Sampling ํ•˜์—ฌ ํ•™
์Šตํ•˜๊ณ  ๋‚˜๋จธ์ง€ 10%๋ฅผ ์ด์šฉํ•˜์—ฌ ์‹คํ—˜ํ•œ
๊ฒฐ๊ณผ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค.
โ‘ก Classifier ์˜ ๊ฒฝ์šฐ Random Forest ๊ฐ€ ๋” ๋‚˜
์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.
๋‚˜. Wall Street Journal ์ฝ”ํผ์Šค์— ๋Œ€ํ•œ ์‹คํ—˜
๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฒ€์ฆ์€ 10-Fold Cross Validation ์œผ๋กœ
ํ•˜์˜€์œผ๋ฉฐ, ํ•œ๊ตญ์–ด์™€๋Š” ๋‹ฌ๋ฆฌ ์ž์งˆ์˜ ์ˆ˜ ๋ฐ ์ •๋ณด์˜
ํฌ๊ธฐ๊ฐ€ ๋‹ฌ๋ผ์„œ ๋ชจ๋“  ํ•™์Šต์ง‘ํ•ฉ์„ ๋‹ค ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ
์—ˆ๋‹ค
1) ์‹คํ—˜๊ฒฐ๊ณผ
Classifier Precision Recall f-measure
Decision Tree 0.989 0.946 0.967
Random Forest 0.981 0.942 0.961
2) ๊ฒฐ๊ณผ๋ถ„์„
โ‘  ์•„์ฃผ ํฐ ์ฐจ์ด๋Š” ์•„๋‹ˆ์ง€๋งŒ, ํ•œ๊ตญ์–ด์™€๋Š”
๋‹ฌ๋ฆฌ Decision Tree ๊ฐ€ ์กฐ๊ธˆ ๋” ์ข‹์€ ์„ฑ๋Šฅ
์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.
โ‘ก ๋™์ผํ•œ ์ž์งˆ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜์˜€์œผ
๋ฉฐ, ๋ณ„๋„์˜ Heuristic ์„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ ๋„
๋†’์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ„์œผ๋กœ์จ ํ•ด๋‹น ์ž์งˆ์ด
๋ฒ”์šฉ์ ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.
๋‹ค. ์ž์งˆ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์‹คํ—˜
๊ฐ ์ž์งˆ์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹คํ—˜ํ•˜์˜€์„ ๊ฒฝ์šฐ์— ํ•ด๋‹น
์ž์งˆ์ด ์–ผ๋งˆ๋‚˜ ์ „์ฒด ์ •ํ™•๋ฅ ์„ ๋†’์ด๋Š” ๋ฐ์— ๊ธฐ
์ œ 29 ํšŒ ํ•œ๊ตญ์ •๋ณด์ฒ˜๋ฆฌํ•™ํšŒ ์ถ˜๊ณ„ํ•™์ˆ ๋ฐœํ‘œ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘ ์ œ 15 ๊ถŒ ์ œ 1 ํ˜ธ (2008. 5)
4
์—ฌํ•˜๋Š” ์ง€๋ฅผ ํ…Œ์ŠคํŠธ ํ•ด๋ณด์•˜๋‹ค. ์ „์ฒด ์‹คํ—˜์ง‘ํ•ฉ
์—์„œ ํ…Œ์ŠคํŠธ๋Š” ํ•˜์ง€ ๋ชปํ•˜์˜€์œผ๋ฉฐ, ์ƒ˜ํ”Œ๋ง ํ•œ ๊ฒฐ
๊ณผ์—์„œ ํ•ด๋‹น ์ž์งˆ์— ๋Œ€ํ•œ ์ƒ๋Œ€์ ์ธ ๋น„๊ต๋ฅผ ์ˆ˜
ํ–‰ํ•˜์˜€๋‹ค.
Removed
Feature
Precision Recall f-measure
Complete Set
(All Features)
0.975 0.982 0.978
Base Set
(Punctuation)
0.92
(-5.5%)
0.936
(4.6%)
0.928
(-5%)
Prefix-token-type
0.974
(-0.1%)
0.98
(-0.2%)
0.977
(-0.1%)
Suffix-token-type
0.974
(-0.1%)
0.982 0.978
Prefix-syllable 0.975 0.982 0.978
Suffix-syllable
0.976
(+0.1%)
0.979
(-0.3%)
0.978
Prefix-token
0.978
(+0.3%)
0.979
(-0.3%)
0.978
Suffix-token
0.973
(-0.2%)
0.976
(-0.6%)
0.974
(-0.4%)
Prefix-token-
length
0.964
(-1.1%)
0.984
(+0.2%)
0.974
(-0.4%)
Suffix-token-
length
0.973
(-0.2%)
0.983
(+0.1%)
0.978
Prefix-token-
distance
0.975
0.983
(+0.1%)
0.979
(+0.1%)
Suffix-token-
distance
0.975
0.979
(-0.3%)
0.977
(-0.1%)
1) ์„ ํƒ๋œ ์ž์งˆ์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ, ์•ž/
๋’ค ํ† ํฐ์˜ ๊ธธ์ด ํ† ํฐ์˜ ์œ ํ˜•, ๊ตฌ๋‘์  ์ˆœ์œผ
๋กœ ๊ธ์ •์ ์ธ ์˜ํ–ฅ์„ ๋ณด์˜€๋‹ค.
2) ๋ฐ˜๋ฉด์— ๋’ค ์ฒซ ์Œ์ ˆ ์•ž์˜ ํ† ํฐ ๋“ฑ์˜ ์ž์งˆ์€
์˜คํžˆ๋ ค ์ •ํ™•๋ฅ ์„ ๋–จ์–ด๋œจ๋ฆฌ๋Š” ํ˜„์ƒ์„ ๋ณด์˜€
๋‹ค
๋ผ. ๊ธฐ์—ฌ๋„๊ฐ€ ๋–จ์–ด์ง€๋Š” ์ž์งˆ์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹คํ—˜
Features Precision Recall f-measure
12 features 98% 98.9% 98.4%
10 features 98.5% 98.1% 98.4%
1) ์œ„์—์„œ ์ œ์‹œ๋œ ์ •ํ™•๋ฅ ์„ ๋–จ์–ด๋œจ๋ฆฌ๋Š” ์ž์งˆ
์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ํ™•์‹คํžˆ ์ •ํ™•๋ฅ ์€
์˜ฌ๋ผ๊ฐ”์œผ๋‚˜, f-measure ๋Š” ๋™์ผํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ฌ
์œผ๋ฉฐ, ์ „์ฒด์ ์œผ๋กœ ํ•ด๋‹น ์ž์งˆ์„ ์ œ๊ฑฐํ•˜์—ฌ
์ •ํ™•๋ฅ ์€ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ์œผ๋‚˜, ์žฌํ˜„์œจ์€
๋–จ์–ด์กŒ๋‹ค.
๋งˆ. ํ•™์Šต๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋ณ€ํ™˜
1) ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ 1,000 ๊ฑด ์ •๋„์— ๊ฑฐ์˜ 97%
์ด์ƒ์˜ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ๊ทธ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ
์—์„œ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€๋‹ค
2) ๊ทธ๋ž˜ํ”„ ์ƒ์œผ๋กœ๋Š” ์ž˜ ๋ณด์ด์ง€ ์•Š์ง€๋งŒ, ์ผ๋ถ€
์„ฑ๋Šฅ์ด ํŠ€๋Š” ํ˜„์ƒ๋„ ๋ณด์˜€๋‹ค
7. ๊ฒฐ๋ก 
๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ
๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, Decision Tree ์™€
Random Forest ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ์‹คํ—˜ ๋ฐ ๋น„๊ต
๋ฅผ ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์–ธ์–ด์— ์ข…์†์ ์ด์ง€ ์•Š์€ ๋…๋ฆฝ์ ์ธ
์ž์งˆ๋งŒ์„ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•˜์˜€์œผ๋ฉฐ, ๊ธฐ์กด์˜ ํ•œ๊ตญ์–ด
๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์˜ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ ํŠน์ • ๊ตฌ๋‘์  ์ •
๋ณด์— ์˜์กดํ•˜์ง€ ์•Š๊ณ ์„œ๋„ ๋ณด๋‹ค ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ
์œผ๋ฉฐ, ๊ตฌ๋‘์ ์˜ ๋ฐœ์ƒํšŸ์ˆ˜์— ๋”ฐ๋ฅธ ํŽธ์ฐจ ๋“ฑ์˜ ํ˜„์ƒ์ด
์—†๊ณ , Spoken Language ์™€ Written Language ์— ๋”ฐ๋ฅธ ์ •
ํ™•๋ฅ ์ด ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ๋„ ๋งŽ์ด ๋ณด์ •๋˜์—ˆ๋‹ค.
์˜๋ฌธ ์ฝ”ํผ์Šค์—๋„ ๋™์ผํ•˜๊ฒŒ ์ ์šฉํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ
๋†’์€ ์ •ํ™•๋ฅ ์„ ๋ณด์ž„์œผ๋กœ์จ ์„ ํƒํ•œ ์ž์งˆ์ด ๋ณด๋‹ค ๋ฒ”์šฉ
์ ์ธ ๊ฒƒ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ  ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ต
ํ•œ ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํ–ฅํ›„ ์—ฐ๊ตฌ๋กœ๋Š” ์ •ํ˜•ํ™” ๋˜์–ด์žˆ์ง€ ์•Š์€ ๋ฌธ์„œ ์ฆ‰, ์›น
๋ฌธ์„œ๋‚˜ ๊ตฌ๋‘์ ์ด ์ƒ๋žต๋œ ๋ฌธ์„œ ๋“ฑ์˜ ๊ฒฝ์šฐ์—๋„ ์–ผ๋งˆ๋‚˜
์ž˜ ๋™์ž‘ํ•˜๋Š” ์ง€์— ๋Œ€ํ•ด์„œ๋„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด
์ž์งˆ์— ๋Œ€ํ•œ ๊ฒƒ๋“ค๋„ ๊ณ ๋ คํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.
์ฐธ๊ณ ๋ฌธํ—Œ
[1] 1989, Some applications of tree-based modeling to
speech and language
[2] 1994, Adaptive Sentence Boundary Disambiguation -
Palmer, Hearst
[3] 1997, A Maximum Entropy Approach to Identifying
Sentence Boundaries - Reynar, Ratnaparkhi
[4] 1997, Adaptive Multilingual Sentence Boundary
Disambiguation - Palmer, Hearst
[5] 2000, Reducing parsing complexity by intra-sentence
segmentation based on maximum entropy model
[6] 2000, Tagging sentence boundaries
[7] 2004, Dependency structure analysis and sentence
boundary detection in spontaneous Japanese
[8] 2004, Korean Sentence Boundary Detection Using
Memory-based Machine Learning - Lim, Han
[9] 2005, Instance-based sentence boundary determination by
optimization for natural language generation
[10] 2006, Unsupervised Multilingual Sentence Boundary
Detection

More Related Content

Featured

2024 State of Marketing Report โ€“ by Hubspot
2024 State of Marketing Report โ€“ by Hubspot2024 State of Marketing Report โ€“ by Hubspot
2024 State of Marketing Report โ€“ by HubspotMarius Sescu
ย 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
ย 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
ย 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
ย 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
ย 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
ย 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
ย 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
ย 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
ย 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
ย 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
ย 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
ย 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
ย 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
ย 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
ย 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
ย 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
ย 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
ย 

Featured (20)

2024 State of Marketing Report โ€“ by Hubspot
2024 State of Marketing Report โ€“ by Hubspot2024 State of Marketing Report โ€“ by Hubspot
2024 State of Marketing Report โ€“ by Hubspot
ย 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
ย 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
ย 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ย 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
ย 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
ย 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
ย 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
ย 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
ย 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
ย 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
ย 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
ย 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
ย 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
ย 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
ย 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
ย 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
ย 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
ย 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
ย 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
ย 

KIPS_C2008A_0034

  • 1. ์ œ 29 ํšŒ ํ•œ๊ตญ์ •๋ณด์ฒ˜๋ฆฌํ•™ํšŒ ์ถ˜๊ณ„ํ•™์ˆ ๋ฐœํ‘œ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘ ์ œ 15 ๊ถŒ ์ œ 1 ํ˜ธ (2008. 5) 1 ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ด์šฉํ•œ ๋ฌธ์žฅ๊ฒฝ๊ณ„์ธ์‹ ๋ฐ•์ˆ˜ํ˜*, ์ž„ํ•ด์ฐฝ** *๊ณ ๋ ค๋Œ€ํ•™๊ต ์ปดํ“จํ„ฐ ์ •๋ณดํ†ต์‹ ๋Œ€ํ•™์› **๊ณ ๋ ค๋Œ€ํ•™๊ต ์ปดํ“จํ„ฐํ•™๊ณผ e-mail : *park.suhyuk@gmail.com, **rim@nlp.korea.ac.kr Sentence Boundary Detection Using Machine Learning Techniques Su-Hyuk Park*, Hae-Chang Rim** * Graduate School of Computer and Information Technology, Korea University ** Dept. of Computer Science and Engineering, Korea University ์š” ์•ฝ ๋ณธ ๋…ผ๋ฌธ์€ ์–ธ์–ด์˜ ํ†ต๊ณ„์  ํŠน์ง•์„ ์ด์šฉํ•˜์—ฌ ๋ฒ”์šฉ์˜ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ• ์€ ๋Œ€๋Ÿ‰์˜ ์ฝ”ํผ์Šค ๋‚ด์—์„œ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋Š” ๋ฌธ์žฅ ๊ฒฝ๊ณ„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์Œ์ ˆ ๋ฐ ์–ด์ ˆ ๋“ฑ์˜ ์ž์งˆ์„ ์ด์šฉํ•˜์—ฌ ํ†ต๊ณ„์  ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ณ  ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ๊ฒฝ๊ณ„๋ฅผ ์ธ์‹ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๋˜ํ•œ ํŠน ์ • ์–ธ์–ด๋‚˜ ๋„๋ฉ”์ธ์— ์ œํ•œ์ ์ด์ง€ ์•Š๊ณ  ๋ฒ”์šฉ์ ์ธ ์ž์งˆ๋งŒ์„ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•˜์˜€๋‹ค. ์–ธ์–ด์˜ ํŠน์„ฑ์ƒ ๋ฌธ์žฅ์˜ ๊ตฌ๋ถ„์ด ์• ๋งคํ•œ ๊ฒฝ์šฐ ๋˜๋Š” ์ž˜๋ชป ์‚ฌ์šฉ ๋œ ๊ตฌ๋‘์  ๋“ฑ์˜ ๊ฒฝ์šฐ์—๋„ ์ ์šฉ ๊ฐ€๋Šฅ ํ•˜๋„๋ก ๋‹ค์–‘ํ•œ ์ž์งˆ์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜ํ•˜์˜€์œผ๋ฉฐ, ํ•œ๊ตญ์–ด์™€ ์˜๋ฌธ ์ฝ”ํผ์Šค์— ๋Œ€ํ•ด์„œ ๋™์ผํ•œ ์ž์งˆ์„ ์ ์šฉ ํ•˜์—ฌ ์‹คํ—˜ํ•˜์—ฌ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ์ž์งˆ๋“ค์ด ํ•œ๊ตญ์–ด ๋ฐ ๋‹ค๋ฅธ ์–ธ์–ด๊ถŒ์˜ ์–ธ์–ด์—๋„ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๋ฒ” ์šฉ์ ์ธ ์ž์งˆ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์„ ์œ„ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๋ฐ ์‹คํ—˜์„ ์œ„ํ•ด์„œ ์„ธ์ข…๊ณ„ํš ์ฝ”ํผ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์„ฑ ๋Šฅ์ฒ™๋„๋กœ๋Š” ์ •ํ™•๋ฅ ๊ณผ ์žฌํ˜„์œจ์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์‹คํ—˜๊ฒฐ๊ณผ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ 99%์˜ ์ •ํ™•๋ฅ ๊ณผ 99.2%์˜ ์žฌํ˜„์œจ์„ ๋ณด์˜€๋‹ค. ์˜๋ฌธ์˜ ๊ฒฝ์šฐ๋Š” Wall Street Journal ์ฝ”ํผ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๋™์ผํ•œ ์ž์งˆ์„ ์ ์šฉํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ 98.9%์˜ ์ •ํ™•๋ฅ ๊ณผ 94.6%์˜ ์žฌํ˜„์œจ์„ ๋ณด์˜€๋‹ค. 1. ์„œ๋ก  โ€˜๋ฌธ์žฅโ€™์˜ ์‚ฌ์ „์ ์ธ ์˜๋ฏธ๋Š” '์˜์‚ฌ๋ฅผ ์ „๋‹ฌํ•˜๋Š” ์ตœ์†Œ์˜ ๋‹จ์œ„'๋กœ ์ •์˜๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ „ํ†ต ๋ฌธ๋ฒ•์—์„œ๋Š” '๋น„๊ต์  ์™„ ์ „ํ•˜๊ณ  ๋…๋ฆฝ๋œ ์˜์‚ฌ์ „๋‹ฌ ๋‹จ์œ„๋‹ค'๋ผ๊ณ  ์ •์˜ํ•˜๊ณ  ์žˆ์œผ ๋ฉฐ, ์ด๋Ÿฌํ•œ ๋ฌธ์žฅ์˜ ๋‹จ์œ„๊ฐ€ ๋Œ€๋ถ€๋ถ„์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋„ ๊ตฌ, ํ’ˆ์‚ฌ๋ถ€์ฐฉ๊ธฐ ๋“ฑ์˜ ๊ธฐ๋ณธ๋‹จ์œ„๊ฐ€ ๋˜๊ณ  ์žˆ์œผ๋ฉฐ, ์Œ์„ฑ ์ธ์‹๋œ ๋ฌธ์žฅ ๋˜๋Š” OCR ์ฒ˜๋ฆฌ๋œ ๋ฌธ์„œ ๋“ฑ์˜ ๋ฌธ์žฅ๊ฒฝ๊ณ„๊ฐ€ ๋ชจํ˜ธํ•œ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ๋„ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์œผ๋กœ๋„ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์˜ ์ธ์‹์ž‘์—…์ด ๋ฐ˜๋“œ์‹œ ์š”๊ตฌ๋œ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค๋ก€์—์„œ๋Š” ๋ฌธ์žฅ๊ฒฝ๊ณ„์˜ ๊ตฌ๋ถ„์ด ๋ช…ํ™•ํ•˜์ง€ ๋ชป ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ด ๋ณด์ด๋ฉฐ, ์‚ฌ๋žŒ์ด ๋ณด์•„์„œ๋„ ์ œ๋Œ€๋กœ ๊ตฌ ๋ถ„๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋„ ๋งŽ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ, ๋Š๋‚Œํ‘œ ๋“ฑ์˜ ๊ธฐํ˜ธ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฌธ์žฅ๊ฒฝ๊ณ„๊ฐ€ ๊ตฌ๋ถ„๋œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์œผ๋‚˜, "๋ฌธ์žฅ์˜ ์‹œ์ž‘๋ฒˆํ˜ธ, e-mail, ์˜์–ด์˜ ์•ฝ์–ด, ๊ฐ•์กฐ๋ฅผ ์œ„ํ•œ ๋ฐ˜๋ณต, ์—ด๊ฑฐ ๋“ฑ์— ์‚ฌ์šฉ๋˜๋Š” ๋งˆ์นจํ‘œ ๋˜๋Š” ์ž˜๋ชป ์‚ฌ์šฉ๋œ ๊ตฌ๋‘์ ์˜ ๊ฒฝ์šฐ์™€ ๊ฐ™์ด ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋„ ๋งŽ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์˜ ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•์€ ๊ฐ ๋ถ„์•ผ ์˜ ์ „๋ฌธ๊ฐ€๊ฐ€ ํ•ด๋‹น ์œ„์น˜๋ฅผ ์ธ์‹ํ•˜๋Š” ๊ทœ์น™์„ ์ •ํ•˜๋Š” ๊ฒƒ ์ด ๊ฐ€์žฅ ์ด์ƒ์ ์ผ ์ˆ˜ ์žˆ์œผ๋‚˜, ๋„ˆ๋ฌด ๋งŽ์€ ๋น„์šฉ์ด ์†Œ๋ชจ ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜์–ด ์ถ”์ฒœํ•  ๋งŒํ•œ ๋ฐฉ๋ฒ•์ด ์•„๋‹ˆ๋ฉฐ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋Œ€๋Ÿ‰์˜ ์ฝ”ํผ์Šค์—์„œ ์ถ”์ถœํ•œ ๋ฌธ์žฅ๊ฒฝ๊ณ„์˜ ํ›„ ๋ณด๋กœ ์‚ผ์„ ์ˆ˜ ์žˆ๋Š” ๊ตฌ๋‘์ ์„ ์ถ”์ถœํ•˜๊ณ  ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ• ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ๊ฒฝ๊ณ„๋กœ ์ธ์‹๋˜๋Š” ์œ„์น˜๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์ž์งˆ์„ ํ•™์Šตํ•˜๊ณ , ํ–ฅํ›„ ์ƒˆ๋กœ์šด ๋ฌธ์„œ์— ๋Œ€ํ•˜์—ฌ ๋ฌธ์žฅ๊ฒฝ๊ณ„๋ฅผ ํŒ๋‹จํ•ด๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹ ๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋‹ค์–‘ํ•œ ์–ธ์–ด์— ๋Œ€ํ•œ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์€ ์ตœ๋Œ€ํ•œ ๋งŽ์€ ์–ธ์–ด์— ๋Œ€ํ•œ ์‹คํ—˜์ด ํ•„์š”ํ•˜๊ฒ ์œผ๋‚˜, ์ฝ”ํผ์Šค ๋˜๋Š” ํ•ด๋‹น ์–ธ์–ด์— ๋Œ€ํ•œ ์ดํ•ด์˜ ๋ถ€์กฑ์œผ๋กœ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•œ๊ตญ์–ด ๋ฐ ์˜์–ด์— ๋Œ€ํ•ด์„œ๋งŒ ์‹คํ—˜ํ•˜๊ธฐ๋กœ ํ•˜์˜€๋‹ค. ๊ธฐ๊ณ„ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋„๊ตฌ๋กœ๋Š” WEKA Workbench ๋ผ๋Š” ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ ํˆดํ‚ท์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์ œ๊ณต๋˜๋Š” ๋‹ค์–‘ํ•œ ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์˜ํ•ด์„œ ์ •ํ™•๋ฅ ๊ณผ ์žฌํ˜„์œจ์ด ํฌ๊ฒŒ ์ฐจ์ด ๋‚˜๋Š” ๊ฒฝ์šฐ๋Š” ์—†์—ˆ์œผ๋‚˜, ์„ฑ๋Šฅ ๋ฉด์—์„œ Random Forest ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ ๋Šฅ์„ ๋ณด์˜€๊ณ , ํ•™์Šต์— ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค๋Š” ์›์‹œ ์ฝ”ํผ์Šค๋งŒ ์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์–ด์ ˆ๊ธฐ์ค€์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๊ณ  ๋‹ค์‹œ ์–ด์ ˆ ์—์„œ ๊ตฌ๋‘์  ํ›„๋ณด๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด, ๊ตฌ๋‘์ ์„ ํฌํ•จํ•˜์—ฌ ๋ณ„ ๋„์˜ ํ† ํฐ์œผ๋กœ ๋ถ„๋ฆฌํ•ด ๋‚ด์–ด ๋ถ„์„์„ ์‹œ๋„ํ•˜์˜€๋‹ค. 2. ๊ด€๋ จ์—ฐ๊ตฌ Riley, Michael D. (1989)๋Š” ๊ตฌ๋‘์ ์ด ๋ฐœ์ƒํ•˜๋Š” ์ฃผ๋ณ€ ๋‹จ์–ด์˜ ์ถœํ˜„ํ™•๋ฅ  ๋ฐ ๊ตฌ๋‘์ ์ด ๋ฐœ๊ฒฌ๋œ ์–ด์ ˆ์˜ ํด๋ž˜์Šค ๋“ฑ์˜ ์ž์งˆ์„ ์ถ”์ถœํ•˜์˜€๊ณ , ํ™•๋ฅ ์ •๋ณด๋Š” AP News 2 ์ฒœ 5 ๋ฐฑ๋งŒ ๋‹จ์–ด๋ฅผ ํ†ตํ•ด ๊ตฌํ•˜์˜€์œผ๋ฉฐ, Brown ์ฝ”ํผ์Šค ์—์„œ Decision Tree (C4.5)๋ฅผ ์ด์šฉํ•œ ๊ฒฐ๊ณผ 99.8%์˜ ์ •ํ™•๋ฅ ์„ ๋ณด์˜€๋‹ค.
  • 2. ์ œ 29 ํšŒ ํ•œ๊ตญ์ •๋ณด์ฒ˜๋ฆฌํ•™ํšŒ ์ถ˜๊ณ„ํ•™์ˆ ๋ฐœํ‘œ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘ ์ œ 15 ๊ถŒ ์ œ 1 ํ˜ธ (2008. 5) 2 David D. Palmer, Marti A. Hearst, (1994)๋Š” ๊ตฌ๋‘์  ์ฃผ ๋ณ€์˜ ๋‹จ์–ด์— ๋Œ€ํ•œ ํ’ˆ์‚ฌ์˜ ํ™•๋ฅ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ 20 ๊ฐ€ ์ง€ ์ •๋„์˜ ํ† ํฐ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜์˜€๊ณ , Feed-forward Neural Network ๋ฅผ ์ด์šฉํ•œ ๊ฒฐ๊ณผ 98.5%์˜ ์ •ํ™•๋ฅ ์„ ๋ณด์˜€๋‹ค. Jeffrey C. Reynar and Adwait Ratnaparkhi, (1997)๋Š” ๊ตฌ ๋‘์  ํ›„๋ณด๊ฐ€ ๋ฐœ์ƒํ•œ ์•ž/๋’ค ํ† ํฐ์˜ ํ™•๋ฅ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜ ์˜€์œผ๋ฉฐ, Maximum Entropy ๊ธฐ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ Wall Street Journal ๋ฐ Brown ์ฝ”ํผ์Šค์—์„œ ๊ฐ๊ฐ 98.0%, 97.5%์˜ ์ • ํ™•๋ฅ ์„ ๋ณด์˜€๋‹ค. ์ž„ํฌ์„, ํ•œ๊ตฐํฌ (2004)๋Š” ํ›„๋ณด ๊ตฌ๋‘์  ์ž์ฒด์˜ ํ™•๋ฅ , ์•ž/๋’ค ๋ฐœ์ƒํ•˜๋Š” ์Œ์ ˆ ๊ทธ๋ฆฌ๊ณ  ์ธ์šฉ๋ถ€ํ˜ธ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ž์งˆ ๋กœ ์ด์šฉํ•˜์˜€์œผ๋ฉฐ, kNN ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ETRI, KAIST ์ฝ” ํผ์Šค์—์„œ ๊ฐ๊ฐ 96.73%, 98.64%์˜ ์ •ํ™•๋ฅ ์„ ๋ณด์˜€๋‹ค. ๋˜ ํ•œ ๋‘ ์ฝ”ํผ์Šค๋ฅผ ๋ชจ๋‘ ํ•™์Šตํ•œ ๊ฒฝ์šฐ์—๋Š” 98.82%์˜ ์ •ํ™• ๋ฅ ์„ ๋ณด์˜€๋‹ค. ์ด์™€ ๊ฐ™์ด ์˜์–ด์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋Š” ๋งŽ์ด ์ด๋ฃจ์–ด์กŒ์œผ๋‚˜ ํ•œ๊ตญ์–ด์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋Š” ์•„์ง ๋งŽ์ง€ ์•Š์€ ์‹ค์ •์ด๋ฉฐ, ๋˜ ํ•œ ๊ธฐ์กด์— ์—ฐ๊ตฌ๋œ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฒฝ๊ณ„์ธ์‹์˜ ๋…ผ๋ฌธ์—์„œ๋„ ์–ธ๊ธ‰ํ•˜์˜€๋“ฏ์ด ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฒฝ๊ณ„์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ ์ฆ˜์„ ํ†ตํ•œ ์‹คํ—˜์ด ํ•„์š”ํ•˜๋‹ค. ๋˜ํ•œ ๊ฐœ๋ณ„ ์–ธ์–ด์— ๋Œ€ํ•ด ์„œ๋งŒ ์‹คํ—˜ ๋ฐ ๊ฒ€์ฆ์ด ์ด๋ฃจ์–ด์ง€๊ณ , ํ•™์Šต์— ์‚ฌ์šฉ๋œ ์ž ์งˆ์ด ๋‹ค๋ฅธ ๋„๋ฉ”์ธ ๋˜๋Š” ๋‹ค๋ฅธ ๊ณ„ํ†ต ์–ธ์–ด์— ๋Œ€ํ•œ ์‹คํ—˜ ๋ฐ ๊ฒ€์ฆ์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ฒ ๋‹ค. ํ•˜์ง€๋งŒ ์ตœ๊ทผ ์›น ์ž๋ฃŒ ๋ฐ ๋ฌธ์„œ์˜ ๊ฒฝ์šฐ ๋‹ค์–‘ํ•œ ์–ธ์–ด ์˜ ์ž๋ฃŒ๊ฐ€ ๋งŽ์ด ๋ฐœ์ƒํ•˜๋ฉฐ, ๊ทธ๋Ÿฌํ•œ ๋ฒ”์šฉ์ ์ธ ์ž์งˆ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐ๋˜๋ฉฐ, ์ด์— ๋ฌธ์žฅ๊ฒฝ๊ณ„์— ํ•„์š”ํ•œ ๋‹ค์–‘ํ•œ ์ž์งˆ์„ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ํ•™์Šต ํ•˜๊ณ  ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฒฝ๊ณ„ ๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผ ์˜์–ด์— ๋Œ€ํ•˜์—ฌ ๋„ ์‹คํ—˜์„ ํ•˜๊ณ ์ž ํ•œ๋‹ค. 3. ๋ฌธ์žฅ ๊ฒฝ๊ณ„์ธ์‹ ๊ฐ€. ์ฝ”ํผ์Šค ์ •์ œ ๋ฐ ๊ตฌ์กฐํ™” 1) ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ๋ฌธ์ž์—ด ์ธ์ฝ”๋”ฉ์„ EUC-KR ์—์„œ UTF-8 ์œผ๋กœ ๋ณ€ํ™˜ 2) ๊นจ์ง„ ๋ฌธ์ž์—ด ๋ฐ ์ค‘๋ณต๋œ ์˜ค๋ถ„์„ ๊ฒฐ๊ณผ ์ œ๊ฑฐ 3) ๋ฌธ์žฅ, ์–ด์ ˆ, ํ† ํฐ๋‹จ์œ„๋กœ ์ €์žฅ 4) ์ „์ฒด ํ•™์Šต ์ฝ”ํผ์Šค๋ฅผ ๊ตฌ์กฐํ™” ํ•˜์—ฌ ์ €์žฅ ๋‚˜. ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ํ†ต๊ณ„์ •๋ณด ์ถ”์ถœ 1) ์ฝ”ํผ์Šค ๋‚ด์˜ ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„๋กœ ์ธ์‹๋œ ์œ„์น˜ ์˜ ๊ตฌ๋‘์ (ํŠน์ˆ˜๋ฌธ์ž ํฌํ•จ)์˜ ๋นˆ๋„์ˆ˜ 2) ๊ตฌ๋‘์  ํ›„๋ณด ์•ž/๋’ค์˜ ํ† ํฐ์œ ํ˜•(๋ณธ ๋…ผ๋ฌธ์—์„œ ๋ถ„๋ฅ˜ํ•œ ํ•œ๊ตญ์–ด์—์„œ ๋ฐœ์ƒํ•˜๋Š” 6 ๊ฐ€์ง€ ํ† ํฐ ์œ ํ˜•)์˜ ๋นˆ๋„์ˆ˜ ๋‹ค. ๋ฌธ์žฅ๊ฒฝ๊ณ„๋ฅผ ํŒ๋ณ„ํ•˜๊ธฐ ์œ„ํ•œ ํ†ต๊ณ„์ •๋ณด ์ถ”์ถœ 1) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์œ„์น˜์—์„œ ๊ตฌ๋‘์ ์˜ ์œ  ํ˜•์— ๋”ฐ๋ฅธ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜ 2) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์•ž/๋’ค ํ† ํฐ์— ๋Œ€ํ•˜์—ฌ ํ† ํฐ์œ ํ˜•์— ๋”ฐ๋ฅธ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜ 3) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์•ž/๋’ค 1 ์Œ์ ˆ์— ๋Œ€ํ•œ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜ 4) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์•ž/๋’ค ์–ด์ ˆ์— ๋Œ€ํ•œ ๋ฌธ ์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜ 5) ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด ์•ž/๋’ค ํ† ํฐ์— ๋Œ€ํ•œ ๋ฌธ ์žฅ๊ฒฝ๊ณ„ ์œ ๋ฌด ๋นˆ๋„์ˆ˜ 4. ํ†ต๊ณ„์  ์ž์งˆ ๊ฐ€. ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์œ„์น˜์˜ ์„ ํƒ ๋ชจ๋“  ๋ฌธ์žฅ๊ฒฝ๊ณ„์—์„œ ๋ฐœ์ƒํ•œ ์Œ์ ˆ ๋˜๋Š” ๋ฌธ์ž์— ๋Œ€ํ•œ ํ†ต๊ณ„์ •๋ณด๋ฅผ ํ†ตํ•˜์—ฌ, ๊ฐ€์žฅ ๋งŽ์ด ๋ฐœ์ƒํ•œ ์ƒ ์œ„ 99.9%์˜ ์Œ์ ˆ ๋˜๋Š” ๋ฌธ์ž๊ฐ€ ์ถœํ˜„ํ•˜๋Š” ์œ„์น˜๋ฅผ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋กœ ๋ณด์•˜์œผ๋ฉฐ, ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ์ด 5 ๊ฐ€์ง€์˜ ๊ตฌ๋‘์ (๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ, ๋Š๋‚Œํ‘œ, ํฐ/์ž‘ ์€๋”ฐ์˜ดํ‘œ)์ด ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋กœ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์˜ ์–ด์˜ ๊ฒฝ์šฐ ์ด 3 ๊ฐ€์ง€์˜ ๊ตฌ๋‘์ (๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ, ํฐ ๋”ฐ์˜ดํ‘œ)์ด ํ›„๋ณด๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ๊ฐ€ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด์ข…๋ฅ˜๊ฐ€ ๋” ๋งŽ ์œผ๋ฏ€๋กœ ํ•œ๊ตญ์–ด๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํ‘œํ˜„ ํ•˜์˜€๋‹ค. ๋‚˜. ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋ฅผ ์œ„ํ•œ ๊ตฌ๋‘์  1) ๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ, ๋Š๋‚Œํ‘œ, ํฐ/์ž‘์€๋”ฐ์˜ดํ‘œ 2) ํ›„๋ณด๊ฐ€ ๋ฐœ์ƒํ•œ ๊ฒฝ์šฐ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ํ›„๋ณด๋กœ ์ธ์‹. ๋‚˜๋Š” ํ•™๊ต์— ๊ฐ‘๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ง‘์œผ๋กœ ์˜ต๋‹ˆ๋‹ค. โ†‘ โ†‘ ๋‹ค. ๊ตฌ๋‘์ ํ›„๋ณด ์•ž/๋’ค์˜ ํ† ํฐ์œ ํ˜• 1) ํ•œ๊ธ€/ ์™ธ๊ตญ์–ด/ ์ˆซ์ž/ ํŠน์ˆ˜๋ฌธ์ž 2) ์œ ํ˜• 1 (๋งˆ์นจํ‘œ, ๋ฌผ์Œํ‘œ, ๋Š๋‚Œํ‘œ) 3) ์œ ํ˜• 2 (์‰ผํ‘œ, ์ฝœ๋ก , ๋น—๊ธˆ) 4) ์œ ํ˜• 3 (๋”ฐ์˜ดํ‘œ, ๊ด„ํ˜ธ, ์ค„์ž„ํ‘œ, ๋ถ™์ž„ํ‘œ) 5) ์œ ํ˜• 4 (๊ธฐํƒ€ ํŠน์ˆ˜๋ฌธ์ž) 6) ์œ ํ˜• 5 (์œ ํ˜•์— ํฌํ•จ๋˜์ง€ ์•Š๋Š” ๋ฌธ์ž์—ด) ๋‹น์ผ ์ฃผ๊ฐ€๋Š” 13.3% ๋–จ์–ด์ง„ 9.96 ๋‹ฌ๋Ÿฌ๊ฐ€ ๋๋‹ค. ์ˆซ์ž/./๊ธฐํ˜ธ 4 ์ˆซ์ž/./์ˆซ์ž ํ•œ๊ธ€/. ๋ผ. ๊ตฌ๋‘์ ํ›„๋ณด์˜ ์•ž/๋’ค์˜ ์•ž/๋’ค์˜ 1 ์Œ์ ˆ 1) ํ†ต๊ณ„์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐœ์ƒํ•  ํ™•๋ฅ ์„ ์ด์šฉ ์ƒํ’ˆ์„ ๋‚ด๋†“๋Š”๋‹ค. ๋‚ด๋‹ฌ๋ถ€ํ„ฐ๋Š” ๊ฐ€์กฑ๊ฐ„ ๋‹ค/./๋‚ด ๋งˆ. ๊ตฌ๋‘์ ํ›„๋ณด ์•ž/๋’ค์˜ ํ† ํฐ ๋ฌธ์ž์—ด 1) ํ†ต๊ณ„์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐœ์ƒํ•  ํ™•๋ฅ ์„ ์ด์šฉ ์žก์ง€ ์•Š์•˜์œผ๋ฉด ํ•œ๋‹ค"๋ฉฐ "CNN ๊ณผ ํ•จ๊ป˜ ๋Œ€์„  ํ•œ๋‹ค/"/๋ฉฐ ๋ฉฐ/"/CNN ๋ฐ”. ๊ตฌ๋‘์ ํ›„๋ณด ์•ž/๋’ค์˜ ํ† ํฐ์˜ ๊ธธ์ด 1) ๊ธธ์ด๋Š” ์Œ์ ˆ์˜ ๊ฐœ์ˆ˜ 57.7%๋กœ ์ง€๋‚œ 2003 ๋…„ 12 ์›”(104.4%) ์ดํ›„ 2/./7 3/./1 ์‚ฌ. ๊ตฌ๋‘์ ํ›„๋ณด ์ด์ „/๋‹ค์Œ ๊ตฌ๋‘์  ํ›„๋ณด๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ 1) ๊ฑฐ๋ฆฌ๋Š” ์‚ฌ์ด์— ์กด์žฌํ•˜๋Š” ํ† ํฐ์˜ ์ˆ˜ 0.6 -2.5 ์™€ํŠธ์˜ TDP ๊ทœ๊ฒฉ์„ ๋”ฐ๋ฅด๋ฉฐ 1.8GHz ๊นŒ์ง€ 1/./3 3/./6 6/./3 5. ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ• ์„ ํƒ ๊ฐ€. ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์œผ๋กœ ์‹คํ—˜ 1) ์ถฉ๋ถ„ํžˆ ์ž‘์€ ํ•™์Šต์ง‘ํ•ฉ ์ƒ˜ํ”Œ๋ง ํ•ด๋‹น ์ฝ”ํผ์Šค์— ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ• ์„ ํƒ์„ ์œ„ํ•จ์ด๋ฏ€๋กœ, ๊ฒฐ๊ณผ์— ์™œ๊ณก์ด ๋ฐœ์ƒ ํ•˜์ง€ ์•Š์„ ์ •๋„์˜ ์ž‘์€ ์ง‘ํ•ฉ์„ ์ƒ˜ํ”Œ๋ง ํ•˜์˜€๋‹ค. 2) ์ถ”์ถœ๋œ ํ•™์Šต์ง‘ํ•ฉ์„ ํ†ตํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ
  • 3. ์ œ 29 ํšŒ ํ•œ๊ตญ์ •๋ณด์ฒ˜๋ฆฌํ•™ํšŒ ์ถ˜๊ณ„ํ•™์ˆ ๋ฐœํ‘œ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘ ์ œ 15 ๊ถŒ ์ œ 1 ํ˜ธ (2008. 5) 3 ์ฆ˜์— ๋Œ€ํ•˜์—ฌ ์‹คํ—˜์„ ์ˆ˜ํ–‰ 3) ์‹คํ—˜ ๋Œ€์ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘์—์„œ ์ƒ์œ„ 2 ๊ฐœ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ ํƒ 4) ์ถฉ๋ถ„ํžˆ ํฐ ํ•™์Šต์ง‘ํ•ฉ์— ๋Œ€ํ•˜์—ฌ ๋‹ค์‹œ ์‹คํ—˜ ์„ ํƒํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ „์ฒด ํ•™์Šต์ง‘ํ•ฉ์—์„œ๋„ ์ž˜ ๋™์ž‘ํ•˜๋Š” ์ง€๋ฅผ ์œ„ํ•œ ์‹คํ—˜์„ ๋‹ค์‹œ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๋‚˜. 1 ์ฐจ ์‹คํ—˜๊ฒฐ๊ณผ ์šฐ์„  1 ์ฐจ ์‹คํ—˜์œผ๋กœ์จ ์ถฉ๋ถ„ํžˆ ์ž‘์€ ํ•™์Šต์ง‘ํ•ฉ์œผ ๋กœ ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ์‹คํ—˜ํ•ด ๋ณด๊ณ , ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚œ 2 ๊ฐœ์˜ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์œผ๋กœ 2 ์ฐจ ์‹คํ—˜์„ ํ•˜๊ณ ์ž ํ•œ๋‹ค. 1) ์ƒ˜ํ”Œ๋ง ํ•™์Šต์ง‘ํ•ฉ ํ•™์Šต ํ›„ ์‹คํ—˜๊ฒฐ๊ณผ Classifier Precision Recall f-measure BayesNet 0.962 0.917 0.939 NaiveBayes 0.954 0.878 0.915 Logistic 0.965 0.977 0.971 Multilayer Perceptron 0.957 0.97 0.964 kNN 0.954 0.963 0.958 AdaBoost 0.934 0.953 0.944 Bagging 0.966 0.984 0.975 Dagging 0.952 0.963 0.957 Decorate 0.967 0.966 0.966 LogitBoost 0.947 0.95 0.949 ADTree 0.942 0.964 0.953 J48(C4.5) 0.97 (1st ) 0.969 0.964 Random Forest 0.968 (2nd ) 0.981 0.974 REPTree 0.96 0.983 0.971 2) ๊ฒฐ๊ณผ๋ถ„์„ โ‘  ์ด 885 ๊ฐœ์˜ ์–ด์ ˆ (0.1%)์„ ํ•™์Šต โ‘ก ์ •ํ™•๋ฅ ๊ธฐ์ค€ ์ƒ์œ„ 2 ๊ฐœ์˜ Classifier ์„ ํƒํ•˜ ์˜€์œผ๋ฉฐ, ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์˜ ๊ฒฝ์šฐ ์žฌํ˜„์œจ ๋ณด๋‹ค๋Š” ์ •ํ™•๋ฅ ์ด ์ข€ ๋” ์˜๋ฏธ ์žˆ๋Š” ์ง€ํ‘œ ๋ผ๊ณ  ํŒ๋‹จํ•˜์˜€๋‹ค. ๋‹ค. 2 ์ฐจ ์‹คํ—˜๊ฒฐ๊ณผ 1) 30% ์ƒ˜ํ”Œ๋ง ํ•™์Šต์ง‘ํ•ฉ ํ•™์Šต ํ›„ ์‹คํ—˜๊ฒฐ๊ณผ Classifier Precision Recall f-measure Decision Tree C4.5 0.979 0.98 0.98 Random Forest 0.983 0.986 0.985 2) 70% ์ƒ˜ํ”Œ๋ง ํ•™์Šต์ง‘ํ•ฉ ํ•™์Šต ํ›„ ์‹คํ—˜๊ฒฐ๊ณผ Classifier Precision Recall f-measure Decision Tree C4.5 0.98 0.989 0.984 Random Forest 0.991 0.992 0.991 3) ๊ฒฐ๊ณผ๋ถ„์„ โ‘  265,529 ์–ด์ ˆ(30%), 619,567 ์–ด์ ˆ(70%) โ‘ก 1,000 ๊ฐœ๊ฐ€ ๋˜์ง€ ์•Š๋Š” ํ•™์Šต์ง‘ํ•ฉ๊ณผ ๊ทธ์˜ 1,000 ๋ฐฐ๊ฐ€ ๋„˜๋Š” ํ•™์Šต์ง‘ํ•ฉ๊ณผ ์ •ํ™•๋ฅ  , ์žฌํ˜„์œจ ๋ชจ๋‘ 3%์ •๋„ ๋ฐ–์— ํ–ฅ์ƒ์ด ์—† ์—ˆ๋‹ค. ๋ผ. ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ• ์„ ํƒ 1) ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์ƒ Random Forest ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๋ชจ๋“  ํ•™์Šต์ง‘ํ•ฉ์„ ํ•™์Šต ์‹œํ‚ฌ ์ˆ˜๋Š” ์—†์—ˆ์œผ๋ฉฐ, 70%์ •๋„ (์•ฝ 60 ๋งŒ ์–ด์ ˆ)์˜ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ๊ฐ€์žฅ ๋‚ณ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํŒ๋‹จ๋จ 6. ์‹คํ—˜๊ฒฐ๊ณผ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•œ๊ธ€์˜ ๊ฒฝ์šฐ ์„ธ์ข…์ฝ”ํผ์Šค 100 ๋งŒ ์–ด์ ˆ ์ค‘ 70 ๋งŒ ์–ด์ ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ 99.1%์˜ ์ •ํ™•๋ฅ  ์„ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, Spoken ๊ณผ Written ์„ ๋น„์œจ ์„ ๋™์ผํ•˜๊ฒŒ ํ•˜์—ฌ ์‹คํ—˜์„ ์‹ค์‹œํ•˜์˜€์œผ๋ฉฐ, ํ•™์Šต ๋ฐ ์‹คํ—˜์€ 10-fold cross validation ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์˜๋ฌธ ์ฝ”ํผ์Šค์ธ Wall Street Journal ์˜ ๊ฒฝ์šฐ๋Š” ์ด 110 ๋งŒ ์–ด์ ˆ ๋ชจ๋‘๋ฅผ ์‚ฌ์šฉํ•˜์˜€๊ณ  ๋™์ผํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์‹คํ—˜์„ ์‹ค์‹œํ•˜์˜€๋‹ค. ํ‰๊ฐ€์ฒ™๋„๋กœ๋Š” ๋ฌธ์žฅ ์ •ํ™•๋ฅ (Precision), ๋ฌธ์žฅ ์žฌํ˜„ ์œจ(Recall) ๋ฐ F-measure ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์ฒ™๋„์˜ ์ •์˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์ˆ˜๋ฌธ์žฅ์ถ”์ถœํ•œ์‹œ์Šคํ…œ์ด ์ˆ˜๋ฌธ์žฅ์ •๋‹ต์ถ”์ถœํ•œ์‹œ์Šคํ…œ์ด Pr =ecision ์ˆ˜๋ฌธ์žฅ์ „์ฒด ์ˆ˜๋ฌธ์žฅ์ •๋‹ต์ถ”์ถœํ•œ์‹œ์Šคํ…œ์ด Re =call callecision callecision measureF RePr Re*Pr*2 + =โˆ’ ๊ฐ€. ์„ธ์ข…๊ณ„ํš ์ฝ”ํผ์Šค์— ๋Œ€ํ•œ ์‹คํ—˜ Spoken ๊ณผ Written ๋ชจ๋‘ ํ•™์Šต์— ํฌํ•จ์‹œ์ผฐ์œผ๋ฉฐ, ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚œ Decision Tree ์™€ Random Forest ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ ํƒํ•˜์—ฌ ์‹คํ—˜ํ•˜์˜€๋‹ค 1) ์‹คํ—˜๊ฒฐ๊ณผ Classifier Precision Recall f-measure Decision Tree 0.98 0.989 0.984 Random Forest 0.991 0.992 0.991 โ‘  Spoken ์ฝ”ํผ์Šค์™€ Written ์ฝ”ํผ์Šค์— ๋Œ€ํ•˜ ์—ฌ 90%์ •๋„๋ฅผ Random Sampling ํ•˜์—ฌ ํ•™ ์Šตํ•˜๊ณ  ๋‚˜๋จธ์ง€ 10%๋ฅผ ์ด์šฉํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค. โ‘ก Classifier ์˜ ๊ฒฝ์šฐ Random Forest ๊ฐ€ ๋” ๋‚˜ ์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋‚˜. Wall Street Journal ์ฝ”ํผ์Šค์— ๋Œ€ํ•œ ์‹คํ—˜ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฒ€์ฆ์€ 10-Fold Cross Validation ์œผ๋กœ ํ•˜์˜€์œผ๋ฉฐ, ํ•œ๊ตญ์–ด์™€๋Š” ๋‹ฌ๋ฆฌ ์ž์งˆ์˜ ์ˆ˜ ๋ฐ ์ •๋ณด์˜ ํฌ๊ธฐ๊ฐ€ ๋‹ฌ๋ผ์„œ ๋ชจ๋“  ํ•™์Šต์ง‘ํ•ฉ์„ ๋‹ค ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ ์—ˆ๋‹ค 1) ์‹คํ—˜๊ฒฐ๊ณผ Classifier Precision Recall f-measure Decision Tree 0.989 0.946 0.967 Random Forest 0.981 0.942 0.961 2) ๊ฒฐ๊ณผ๋ถ„์„ โ‘  ์•„์ฃผ ํฐ ์ฐจ์ด๋Š” ์•„๋‹ˆ์ง€๋งŒ, ํ•œ๊ตญ์–ด์™€๋Š” ๋‹ฌ๋ฆฌ Decision Tree ๊ฐ€ ์กฐ๊ธˆ ๋” ์ข‹์€ ์„ฑ๋Šฅ ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. โ‘ก ๋™์ผํ•œ ์ž์งˆ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜์˜€์œผ ๋ฉฐ, ๋ณ„๋„์˜ Heuristic ์„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ„์œผ๋กœ์จ ํ•ด๋‹น ์ž์งˆ์ด ๋ฒ”์šฉ์ ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‹ค. ์ž์งˆ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์‹คํ—˜ ๊ฐ ์ž์งˆ์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹คํ—˜ํ•˜์˜€์„ ๊ฒฝ์šฐ์— ํ•ด๋‹น ์ž์งˆ์ด ์–ผ๋งˆ๋‚˜ ์ „์ฒด ์ •ํ™•๋ฅ ์„ ๋†’์ด๋Š” ๋ฐ์— ๊ธฐ
  • 4. ์ œ 29 ํšŒ ํ•œ๊ตญ์ •๋ณด์ฒ˜๋ฆฌํ•™ํšŒ ์ถ˜๊ณ„ํ•™์ˆ ๋ฐœํ‘œ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘ ์ œ 15 ๊ถŒ ์ œ 1 ํ˜ธ (2008. 5) 4 ์—ฌํ•˜๋Š” ์ง€๋ฅผ ํ…Œ์ŠคํŠธ ํ•ด๋ณด์•˜๋‹ค. ์ „์ฒด ์‹คํ—˜์ง‘ํ•ฉ ์—์„œ ํ…Œ์ŠคํŠธ๋Š” ํ•˜์ง€ ๋ชปํ•˜์˜€์œผ๋ฉฐ, ์ƒ˜ํ”Œ๋ง ํ•œ ๊ฒฐ ๊ณผ์—์„œ ํ•ด๋‹น ์ž์งˆ์— ๋Œ€ํ•œ ์ƒ๋Œ€์ ์ธ ๋น„๊ต๋ฅผ ์ˆ˜ ํ–‰ํ•˜์˜€๋‹ค. Removed Feature Precision Recall f-measure Complete Set (All Features) 0.975 0.982 0.978 Base Set (Punctuation) 0.92 (-5.5%) 0.936 (4.6%) 0.928 (-5%) Prefix-token-type 0.974 (-0.1%) 0.98 (-0.2%) 0.977 (-0.1%) Suffix-token-type 0.974 (-0.1%) 0.982 0.978 Prefix-syllable 0.975 0.982 0.978 Suffix-syllable 0.976 (+0.1%) 0.979 (-0.3%) 0.978 Prefix-token 0.978 (+0.3%) 0.979 (-0.3%) 0.978 Suffix-token 0.973 (-0.2%) 0.976 (-0.6%) 0.974 (-0.4%) Prefix-token- length 0.964 (-1.1%) 0.984 (+0.2%) 0.974 (-0.4%) Suffix-token- length 0.973 (-0.2%) 0.983 (+0.1%) 0.978 Prefix-token- distance 0.975 0.983 (+0.1%) 0.979 (+0.1%) Suffix-token- distance 0.975 0.979 (-0.3%) 0.977 (-0.1%) 1) ์„ ํƒ๋œ ์ž์งˆ์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ, ์•ž/ ๋’ค ํ† ํฐ์˜ ๊ธธ์ด ํ† ํฐ์˜ ์œ ํ˜•, ๊ตฌ๋‘์  ์ˆœ์œผ ๋กœ ๊ธ์ •์ ์ธ ์˜ํ–ฅ์„ ๋ณด์˜€๋‹ค. 2) ๋ฐ˜๋ฉด์— ๋’ค ์ฒซ ์Œ์ ˆ ์•ž์˜ ํ† ํฐ ๋“ฑ์˜ ์ž์งˆ์€ ์˜คํžˆ๋ ค ์ •ํ™•๋ฅ ์„ ๋–จ์–ด๋œจ๋ฆฌ๋Š” ํ˜„์ƒ์„ ๋ณด์˜€ ๋‹ค ๋ผ. ๊ธฐ์—ฌ๋„๊ฐ€ ๋–จ์–ด์ง€๋Š” ์ž์งˆ์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹คํ—˜ Features Precision Recall f-measure 12 features 98% 98.9% 98.4% 10 features 98.5% 98.1% 98.4% 1) ์œ„์—์„œ ์ œ์‹œ๋œ ์ •ํ™•๋ฅ ์„ ๋–จ์–ด๋œจ๋ฆฌ๋Š” ์ž์งˆ ์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ํ™•์‹คํžˆ ์ •ํ™•๋ฅ ์€ ์˜ฌ๋ผ๊ฐ”์œผ๋‚˜, f-measure ๋Š” ๋™์ผํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ฌ ์œผ๋ฉฐ, ์ „์ฒด์ ์œผ๋กœ ํ•ด๋‹น ์ž์งˆ์„ ์ œ๊ฑฐํ•˜์—ฌ ์ •ํ™•๋ฅ ์€ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ์œผ๋‚˜, ์žฌํ˜„์œจ์€ ๋–จ์–ด์กŒ๋‹ค. ๋งˆ. ํ•™์Šต๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋ณ€ํ™˜ 1) ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ 1,000 ๊ฑด ์ •๋„์— ๊ฑฐ์˜ 97% ์ด์ƒ์˜ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ๊ทธ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ ์—์„œ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€๋‹ค 2) ๊ทธ๋ž˜ํ”„ ์ƒ์œผ๋กœ๋Š” ์ž˜ ๋ณด์ด์ง€ ์•Š์ง€๋งŒ, ์ผ๋ถ€ ์„ฑ๋Šฅ์ด ํŠ€๋Š” ํ˜„์ƒ๋„ ๋ณด์˜€๋‹ค 7. ๊ฒฐ๋ก  ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, Decision Tree ์™€ Random Forest ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ์‹คํ—˜ ๋ฐ ๋น„๊ต ๋ฅผ ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์–ธ์–ด์— ์ข…์†์ ์ด์ง€ ์•Š์€ ๋…๋ฆฝ์ ์ธ ์ž์งˆ๋งŒ์„ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•˜์˜€์œผ๋ฉฐ, ๊ธฐ์กด์˜ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฒฝ๊ณ„ ์ธ์‹์˜ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ ํŠน์ • ๊ตฌ๋‘์  ์ • ๋ณด์— ์˜์กดํ•˜์ง€ ์•Š๊ณ ์„œ๋„ ๋ณด๋‹ค ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ ์œผ๋ฉฐ, ๊ตฌ๋‘์ ์˜ ๋ฐœ์ƒํšŸ์ˆ˜์— ๋”ฐ๋ฅธ ํŽธ์ฐจ ๋“ฑ์˜ ํ˜„์ƒ์ด ์—†๊ณ , Spoken Language ์™€ Written Language ์— ๋”ฐ๋ฅธ ์ • ํ™•๋ฅ ์ด ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ๋„ ๋งŽ์ด ๋ณด์ •๋˜์—ˆ๋‹ค. ์˜๋ฌธ ์ฝ”ํผ์Šค์—๋„ ๋™์ผํ•˜๊ฒŒ ์ ์šฉํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ๋†’์€ ์ •ํ™•๋ฅ ์„ ๋ณด์ž„์œผ๋กœ์จ ์„ ํƒํ•œ ์ž์งˆ์ด ๋ณด๋‹ค ๋ฒ”์šฉ ์ ์ธ ๊ฒƒ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ  ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ต ํ•œ ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ๋กœ๋Š” ์ •ํ˜•ํ™” ๋˜์–ด์žˆ์ง€ ์•Š์€ ๋ฌธ์„œ ์ฆ‰, ์›น ๋ฌธ์„œ๋‚˜ ๊ตฌ๋‘์ ์ด ์ƒ๋žต๋œ ๋ฌธ์„œ ๋“ฑ์˜ ๊ฒฝ์šฐ์—๋„ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋™์ž‘ํ•˜๋Š” ์ง€์— ๋Œ€ํ•ด์„œ๋„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์ž์งˆ์— ๋Œ€ํ•œ ๊ฒƒ๋“ค๋„ ๊ณ ๋ คํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค. ์ฐธ๊ณ ๋ฌธํ—Œ [1] 1989, Some applications of tree-based modeling to speech and language [2] 1994, Adaptive Sentence Boundary Disambiguation - Palmer, Hearst [3] 1997, A Maximum Entropy Approach to Identifying Sentence Boundaries - Reynar, Ratnaparkhi [4] 1997, Adaptive Multilingual Sentence Boundary Disambiguation - Palmer, Hearst [5] 2000, Reducing parsing complexity by intra-sentence segmentation based on maximum entropy model [6] 2000, Tagging sentence boundaries [7] 2004, Dependency structure analysis and sentence boundary detection in spontaneous Japanese [8] 2004, Korean Sentence Boundary Detection Using Memory-based Machine Learning - Lim, Han [9] 2005, Instance-based sentence boundary determination by optimization for natural language generation [10] 2006, Unsupervised Multilingual Sentence Boundary Detection