6. PROCESSING RAW TEXT
WITH NLTK
(http://www.nltk.org/book/)
웹 상의 HTML 문서로부터 텍스트를 추출 후 ,
NLTK 를 사용하여 텍스트의 키워드를 추출
After extracting a text from HTML document on the
web, I tried to extract keywords from the text with
NLTK.
7. EXAMPLES
이주 아동 외면하는 ' 다문화 한국사회‘
(http://www.huffingtonpost.kr/kyongwhan-
ahn/story_b_6927970.html?utm_hp_ref=korea)
[('', 65), ('(', 9), (')', 9), (' 한다 ', 6), ("'", 6), (' 있다 ', 5),
(' 아동 ', 5), (' 큰 ', 5), (' 모든 ', 5), (' 일 ', 5), (' 국제 ', 4),
(' 대한민국 ', 4), (' 나라 ', 4), (' 땅 ', 4), (' 국제사회 ', 4),
(' 인권 ', 4), (' 의원 ', 3), (' 세계 ', 3), (' 여의 ', 3), (' 수 ', 3),
(' 안 ', 3), (' 강한 ', 3), (' 불문 ', 2), (' 이주 ', 2), (' 법무부 ', 2)]
8. 1. HTML TO RAW TEXT
# -*- coding: utf-8 -*-
from urllib import request
import nltk, re, pprint
from nltk import word_tokenize
from nltk import *
from bs4 import BeautifulSoup
url = “http://www.huffingtonpost.kr/kyongwhan-
ahn/story_b_6927970.html?utm_hp_ref=korea”
html = request.urlopen(url).read().decode(‘utf8’)
raw = BeautifulSoup(html).get_text()
9. 1. HTML TO RAW TEXT
# -*- coding: utf-8 -*-
from urllib import request
import nltk, re, pprint
from nltk import word_tokenize
from nltk import *
from bs4 import BeautifulSoup
url = “http://www.huffingtonpost.kr/kyongwhan-
ahn/story_b_6927970.html?utm_hp_ref=korea”
html = request.urlopen(url).read().decode(‘utf8’)
raw = BeautifulSoup(html).get_text()
10. 2. RAW TEXT TO LIST
raw = raw[30123:32364]
print (type(raw))
-> <class ‘str’>
tokens = word_tokenize(raw)
print (type(tokens))
-> <class ‘list’>
11. 3. LIST TO VOCABULARIES
words = Trial.NounExtractor(token)
12. 3. LIST TO VOCABULARIES
words = Trial.NounExtractor(token)
13. 3. LIST TO VOCABULARIES
token = [‘ 철수는’ , ‘ 동생에게’ , ‘ 자전거를’ , ‘ 빌려주었다’ ]
words = Trial.NounExtractor(token)
words = [‘ 철수’ , ‘ 동생’ , ‘ 자전거’ , ‘ 빌려주었다’ ]