Playing with Natural Language ― in a friendly way

Playing with Natural Language
― in a friendly way
ML/DM Monday
2013/01/28

我是誰
✓ 蔡家琦 (Tsai, Chia-Chi)

✓ ID : Rueshyna (Rues)
✓ 工作 : Ruby on Rails

✓ Machine Learning & Text Mining

Natural language
processing

Natural language
processing
自然語言處理

Natural language
processing
自然語言處理
人類語言

語意

文法問題

文字

讓電腦瞭解人類的語言

Natural Language
ToolKit
Python 函式庫

Natural Language
ToolKit
Python 函式庫
快速處理 NLP 的問題

但有開發版
還在努力中...

OPEN

http://nltk.org/book/

安裝
pip install pyyaml nltk

斷句
Today many historians think that only about twenty percent
of the colonists supported Britain. Some colonists supported
whichever side seemed to be winning.
-- from VOA

斷句
Today many historians think that only about twenty percent
of the colonists supported Britain. Some colonists supported
whichever side seemed to be winning.
-- from VOA

利用“.”

•Today many historians think that only about twenty
percent of the colonists supported Britain.

•Some colonists supported whichever side seemed to be
winning.

斷句

Iowa-based political committee in 2007 and has grown larger
since taking a leading role now against Mr. Hagel. “Postelection
we have new battle lines being drawn with the president; he
kicks it off with these nominations and it made sense for us.”

-- from New York Times

斷句
Iowa-based political committee in 2007 and has grown larger
since taking a leading role now against Mr. Hagel. “Postelection
we have new battle lines being drawn with the president; he
kicks it off with these nominations and it made sense for us.”

-- from New York Times
利用“.”

•Iowa-based political committee in 2007 and has grown larger since taking a
leading role now against Mr.

•Hagel.
•“Postelection we have new battle lines being drawn with the president; he
kicks it off with these nominations and it made sense for us.

•”

分詞(tokenization)

Today is a beautiful day


利用空白
[Today] [is] [a] [beautiful] [day]


利用空白

beautiful day.
“Who knows?”
$50
industry’s
for youths;


利用空白

beautiful day.
$50

?
for youths;
“Who knows?”

industry’s

詞性(pos)
Pierre Vinken , 61 years old , will join the
board as a nonexecutive director Nov. 29 .

詞性(pos)
Pierre Vinken , 61 years old , will join the
board as a nonexecutive director Nov. 29 .

Pierre/NNP join/VB Nov./NNP
Vinken/NNP the/DT 29/CD
61/CD board/NN ./.
years/NNS as/IN
,/, a/DT
old/JJ nonexecutive/JJ
will/MD director/NN penn treebank tag

剖析樹 (parsed tree)
W.R. Grace holds three of Grace Energy's seven board seats .

Corpus
✓ Treebank
✓ 純文字、詞性標記、剖析樹

✓ Brown
✓ 純文字、詞性標記、剖析樹

✓ 路透社(Reuters)
✓ 純文字

Demo1
#!/usr/bin/env python

import nltk
from urllib import urlopen
url="http://www.voanews.com/articleprintview/1587223.html"
html = urlopen(url).read()
raw = nltk.clean_html(html)

#nltk.download(‘punkt’)
sent_tokenizer=nltk.data.load('tokenizers/punkt/
english.pickle')
sents = sent_tokenizer.tokenize(raw)

token = nltk.word_tokenize(sents[1])

#nltk.download(‘maxent_treebank_pos_tagger’)
pos = nltk.pos_tag(token)

Demo1

import nltk
raw = nltk.clean_html(html) 清除html tag
english.pickle')



Demo1

import nltk

english.pickle')
sents = sent_tokenizer.tokenize(raw) 斷句


Demo1

import nltk

english.pickle')

token = nltk.word_tokenize(sents[1]) 分詞

Demo1

import nltk

english.pickle')


pos = nltk.pos_tag(token) 詞性標記

Demo2
#nltk.download(‘treebank’)

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal
from nltk.parse import ChartParser

productions = set(
production for sent in treebank.parsed_sents()[0:9]
for production in sent.productions())

grammar = ContextFreeGrammar(Nonterminal('S'),productions)

parser = ChartParser(grammar)

parsed_tree = parser.parse(treebank.sents()[0])
# print parsed_tree

Demo2

import nltk

productions = set( treebank production



# print parsed_tree

Demo2

import nltk

productions = set(
encoder grammar


# print parsed_tree

Demo2

import nltk

productions = set(


parser = ChartParser(grammar) 產生parser
# print parsed_tree

Demo3
#nltk.download(‘reuters’)

import nltk
from nltk.probability import FreqDist
from nltk.probability import ConditionalFreqDist
from nltk.corpus import reuters
from nltk.corpus import brown

fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))
#fd.tabulate(10)
#fd.plot()

cdf = ConditionalFreqDist((corpus, word)
for corpus in ['reuters', 'brown']
for word in eval(corpus).words()
if word in map(str,range(1900,1950,5)))
#cdf.conditions()
#cdf['brown']['1910']
#cdf.tabulate()
#cdf.plot()

Demo3

import nltk
詞頻統計
#fd.tabulate(10)
#fd.plot()

cdf = ConditionalFreqDist((corpus, word)
#cdf.conditions()
#cdf['brown']['1910']
#cdf.tabulate()
#cdf.plot()

Demo3

import nltk

#fd.tabulate(10)
#fd.plot() 不同條件的詞頻統計
cdf = ConditionalFreqDist((corpus, word) (Conditions and Events)
#cdf.conditions()
#cdf['brown']['1910']
#cdf.tabulate()
#cdf.plot()

Playing with Natural Language ― in a friendly way

More Related Content

Similar to Playing with Natural Language ― in a friendly way

Recently uploaded

Playing with Natural Language ― in a friendly way