Playing  with  Natural  Language
                   ―  in  a  friendly  way
          ML/DM Monday
           2013/01/28
我是誰
✓ 蔡家琦 (Tsai, Chia-Chi)

✓ ID : Rueshyna (Rues)
✓ 工作 : Ruby on Rails

✓ Machine Learning & Text Mining
什麼是NLP?
Natural  language  
  processing
Natural  language  
  processing
           自然語言處理
Natural  language  
  processing
           自然語言處理
           人類語言
ML/DM + 語言學
分析層次
語意


     文法        問題


文字
讓電腦瞭解人類的語言
什麼是NLTK?
Natural Language
    ToolKit
Natural Language
    ToolKit
Python 函式庫
Natural Language
    ToolKit
Python 函式庫
快速處理 NLP 的問題
Python 2 or 3?
Python3
目前沒有標準版
但有開發版
還在努力中...
Python2
 標準版
OPEN




http://nltk.org/book/
官網
http://nltk.org/
安裝
pip install pyyaml nltk
Natural Language
      議題
斷句
Today many historians think that only about twenty percent
of the colonists supported Britain. Some colonists supported
whichever side seemed to be winning.
                                                    -- from VOA
斷句
Today many historians think that only about twenty percent
of the colonists supported Britain. Some colonists supported
whichever side seemed to be winning.
                                                    -- from VOA

                                 利用“.”

•Today many historians think that only about twenty
percent of the colonists supported Britain.

•Some colonists supported whichever side seemed to be
winning.
斷句

Iowa-based political committee in 2007 and has grown larger
since taking a leading role now against Mr. Hagel. “Postelection
we have new battle lines being drawn with the president; he
kicks it off with these nominations and it made sense for us.”

                                        -- from New York Times
斷句
Iowa-based political committee in 2007 and has grown larger
since taking a leading role now against Mr. Hagel. “Postelection
we have new battle lines being drawn with the president; he
kicks it off with these nominations and it made sense for us.”

                                                   -- from New York Times
                                    利用“.”

•Iowa-based political committee in 2007 and has grown larger since taking a
leading role now against Mr.

•Hagel.
•“Postelection we have new battle lines being drawn with the president; he
kicks it off with these nominations and it made sense for us.

•”
分詞(tokenization)

Today is a beautiful day
分詞(tokenization)

Today is a beautiful day
                  利用空白
[Today] [is] [a] [beautiful] [day]
分詞(tokenization)

 Today is a beautiful day
                    利用空白
 [Today] [is] [a] [beautiful] [day]


beautiful day.
                          “Who knows?”
      $50
                                  industry’s
            for youths;
分詞(tokenization)

 Today is a beautiful day
                    利用空白
 [Today] [is] [a] [beautiful] [day]


beautiful day.
      $50

                 ?
            for youths;
                          “Who knows?”

                                  industry’s
詞性(pos)
Pierre Vinken , 61 years old , will join the
board as a nonexecutive director Nov. 29 .
詞性(pos)
Pierre Vinken , 61 years old , will join the
board as a nonexecutive director Nov. 29 .


Pierre/NNP      join/VB         Nov./NNP
Vinken/NNP      the/DT          29/CD
61/CD           board/NN        ./.
years/NNS       as/IN
,/,             a/DT
old/JJ          nonexecutive/JJ
will/MD         director/NN   penn treebank tag
剖析樹 (parsed tree)
W.R. Grace holds three of Grace Energy's seven board seats .
剖析樹 (parsed tree)
W.R. Grace holds three of Grace Energy's seven board seats .
剖析樹 (parsed tree)
W.R. Grace holds three of Grace Energy's seven board seats .
剖析樹 (parsed tree)
W.R. Grace holds three of Grace Energy's seven board seats .
剖析樹 (parsed tree)
W.R. Grace holds three of Grace Energy's seven board seats .
Corpus
 ✓ Treebank
   ✓ 純文字、詞性標記、剖析樹


 ✓ Brown
   ✓ 純文字、詞性標記、剖析樹


 ✓ 路透社(Reuters)
  ✓ 純文字
Demo
Demo1
#!/usr/bin/env python

import nltk
from urllib import urlopen
url="http://www.voanews.com/articleprintview/1587223.html"
html = urlopen(url).read()
raw = nltk.clean_html(html)

#nltk.download(‘punkt’)
sent_tokenizer=nltk.data.load('tokenizers/punkt/
english.pickle')
sents = sent_tokenizer.tokenize(raw)

token = nltk.word_tokenize(sents[1])

#nltk.download(‘maxent_treebank_pos_tagger’)
pos = nltk.pos_tag(token)
Demo1
#!/usr/bin/env python

import nltk
from urllib import urlopen
url="http://www.voanews.com/articleprintview/1587223.html"
html = urlopen(url).read()
raw = nltk.clean_html(html)                        清除html  tag
#nltk.download(‘punkt’)
sent_tokenizer=nltk.data.load('tokenizers/punkt/
english.pickle')
sents = sent_tokenizer.tokenize(raw)

token = nltk.word_tokenize(sents[1])

#nltk.download(‘maxent_treebank_pos_tagger’)
pos = nltk.pos_tag(token)
Demo1
#!/usr/bin/env python

import nltk
from urllib import urlopen
url="http://www.voanews.com/articleprintview/1587223.html"
html = urlopen(url).read()
raw = nltk.clean_html(html)

#nltk.download(‘punkt’)
sent_tokenizer=nltk.data.load('tokenizers/punkt/
english.pickle')
sents = sent_tokenizer.tokenize(raw)                   斷句
token = nltk.word_tokenize(sents[1])

#nltk.download(‘maxent_treebank_pos_tagger’)
pos = nltk.pos_tag(token)
Demo1
#!/usr/bin/env python

import nltk
from urllib import urlopen
url="http://www.voanews.com/articleprintview/1587223.html"
html = urlopen(url).read()
raw = nltk.clean_html(html)

#nltk.download(‘punkt’)
sent_tokenizer=nltk.data.load('tokenizers/punkt/
english.pickle')
sents = sent_tokenizer.tokenize(raw)

token = nltk.word_tokenize(sents[1])                   分詞
#nltk.download(‘maxent_treebank_pos_tagger’)
pos = nltk.pos_tag(token)
Demo1
#!/usr/bin/env python

import nltk
from urllib import urlopen
url="http://www.voanews.com/articleprintview/1587223.html"
html = urlopen(url).read()
raw = nltk.clean_html(html)

#nltk.download(‘punkt’)
sent_tokenizer=nltk.data.load('tokenizers/punkt/
english.pickle')
sents = sent_tokenizer.tokenize(raw)

token = nltk.word_tokenize(sents[1])

#nltk.download(‘maxent_treebank_pos_tagger’)
pos = nltk.pos_tag(token)                           詞性標記
Demo2
#!/usr/bin/env python
#nltk.download(‘treebank’)

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal
from nltk.parse import ChartParser

productions = set(
  production for sent in treebank.parsed_sents()[0:9]
  for production in sent.productions())

grammar = ContextFreeGrammar(Nonterminal('S'),productions)

parser = ChartParser(grammar)

parsed_tree = parser.parse(treebank.sents()[0])
# print parsed_tree
Demo2
#!/usr/bin/env python
#nltk.download(‘treebank’)

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal
from nltk.parse import ChartParser

productions = set(                 treebank production
  production for sent in treebank.parsed_sents()[0:9]
  for production in sent.productions())

grammar = ContextFreeGrammar(Nonterminal('S'),productions)

parser = ChartParser(grammar)

parsed_tree = parser.parse(treebank.sents()[0])
# print parsed_tree
Demo2
#!/usr/bin/env python
#nltk.download(‘treebank’)

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal
from nltk.parse import ChartParser

productions = set(
  production for sent in treebank.parsed_sents()[0:9]
  for production in sent.productions())
                                         encoder grammar
grammar = ContextFreeGrammar(Nonterminal('S'),productions)

parser = ChartParser(grammar)

parsed_tree = parser.parse(treebank.sents()[0])
# print parsed_tree
Demo2
#!/usr/bin/env python
#nltk.download(‘treebank’)

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal
from nltk.parse import ChartParser

productions = set(
  production for sent in treebank.parsed_sents()[0:9]
  for production in sent.productions())

grammar = ContextFreeGrammar(Nonterminal('S'),productions)

parser = ChartParser(grammar)                     產生parser
parsed_tree = parser.parse(treebank.sents()[0])
# print parsed_tree
Demo3
#!/usr/bin/env python
#nltk.download(‘reuters’)

import nltk
from nltk.probability import FreqDist
from nltk.probability import ConditionalFreqDist
from nltk.corpus import reuters
from nltk.corpus import brown

fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))
#fd.tabulate(10)
#fd.plot()

cdf = ConditionalFreqDist((corpus, word)
  for corpus in ['reuters', 'brown']
  for word in eval(corpus).words()
  if word in map(str,range(1900,1950,5)))
#cdf.conditions()
#cdf['brown']['1910']
#cdf.tabulate()
#cdf.plot()
Demo3
#!/usr/bin/env python
#nltk.download(‘reuters’)

import nltk
from nltk.probability import FreqDist
from nltk.probability import ConditionalFreqDist
from nltk.corpus import reuters
from nltk.corpus import brown
                                              詞頻統計
fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))
#fd.tabulate(10)
#fd.plot()

cdf = ConditionalFreqDist((corpus, word)
  for corpus in ['reuters', 'brown']
  for word in eval(corpus).words()
  if word in map(str,range(1900,1950,5)))
#cdf.conditions()
#cdf['brown']['1910']
#cdf.tabulate()
#cdf.plot()
Demo3
#!/usr/bin/env python
#nltk.download(‘reuters’)

import nltk
from nltk.probability import FreqDist
from nltk.probability import ConditionalFreqDist
from nltk.corpus import reuters
from nltk.corpus import brown

fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))
#fd.tabulate(10)
#fd.plot()                               不同條件的詞頻統計
cdf = ConditionalFreqDist((corpus, word)  (Conditions and Events)
  for corpus in ['reuters', 'brown']
  for word in eval(corpus).words()
  if word in map(str,range(1900,1950,5)))
#cdf.conditions()
#cdf['brown']['1910']
#cdf.tabulate()
#cdf.plot()
Thanks!!

Playing with Natural Language ― in a friendly way

  • 1.
    Playing  with  Natural Language ―  in  a  friendly  way ML/DM Monday 2013/01/28
  • 2.
    我是誰 ✓ 蔡家琦 (Tsai,Chia-Chi) ✓ ID : Rueshyna (Rues) ✓ 工作 : Ruby on Rails ✓ Machine Learning & Text Mining
  • 3.
  • 4.
  • 5.
    Natural  language   processing 自然語言處理
  • 6.
    Natural  language   processing 自然語言處理 人類語言
  • 7.
  • 8.
  • 9.
    語意 文法 問題 文字
  • 10.
  • 11.
  • 12.
  • 13.
    Natural Language ToolKit Python 函式庫
  • 14.
    Natural Language ToolKit Python 函式庫 快速處理 NLP 的問題
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    斷句 Today many historiansthink that only about twenty percent of the colonists supported Britain. Some colonists supported whichever side seemed to be winning. -- from VOA
  • 24.
    斷句 Today many historiansthink that only about twenty percent of the colonists supported Britain. Some colonists supported whichever side seemed to be winning. -- from VOA 利用“.” •Today many historians think that only about twenty percent of the colonists supported Britain. •Some colonists supported whichever side seemed to be winning.
  • 25.
    斷句 Iowa-based political committeein 2007 and has grown larger since taking a leading role now against Mr. Hagel. “Postelection we have new battle lines being drawn with the president; he kicks it off with these nominations and it made sense for us.” -- from New York Times
  • 26.
    斷句 Iowa-based political committeein 2007 and has grown larger since taking a leading role now against Mr. Hagel. “Postelection we have new battle lines being drawn with the president; he kicks it off with these nominations and it made sense for us.” -- from New York Times 利用“.” •Iowa-based political committee in 2007 and has grown larger since taking a leading role now against Mr. •Hagel. •“Postelection we have new battle lines being drawn with the president; he kicks it off with these nominations and it made sense for us. •”
  • 27.
  • 28.
    分詞(tokenization) Today is abeautiful day 利用空白 [Today] [is] [a] [beautiful] [day]
  • 29.
    分詞(tokenization) Today isa beautiful day 利用空白 [Today] [is] [a] [beautiful] [day] beautiful day. “Who knows?” $50 industry’s for youths;
  • 30.
    分詞(tokenization) Today isa beautiful day 利用空白 [Today] [is] [a] [beautiful] [day] beautiful day. $50 ? for youths; “Who knows?” industry’s
  • 31.
    詞性(pos) Pierre Vinken ,61 years old , will join the board as a nonexecutive director Nov. 29 .
  • 32.
    詞性(pos) Pierre Vinken ,61 years old , will join the board as a nonexecutive director Nov. 29 . Pierre/NNP join/VB Nov./NNP Vinken/NNP the/DT 29/CD 61/CD board/NN ./. years/NNS as/IN ,/, a/DT old/JJ nonexecutive/JJ will/MD director/NN penn treebank tag
  • 33.
    剖析樹 (parsed tree) W.R.Grace holds three of Grace Energy's seven board seats .
  • 34.
    剖析樹 (parsed tree) W.R.Grace holds three of Grace Energy's seven board seats .
  • 35.
    剖析樹 (parsed tree) W.R.Grace holds three of Grace Energy's seven board seats .
  • 36.
    剖析樹 (parsed tree) W.R.Grace holds three of Grace Energy's seven board seats .
  • 37.
    剖析樹 (parsed tree) W.R.Grace holds three of Grace Energy's seven board seats .
  • 38.
    Corpus ✓ Treebank ✓ 純文字、詞性標記、剖析樹 ✓ Brown ✓ 純文字、詞性標記、剖析樹 ✓ 路透社(Reuters) ✓ 純文字
  • 39.
  • 40.
    Demo1 #!/usr/bin/env python import nltk fromurllib import urlopen url="http://www.voanews.com/articleprintview/1587223.html" html = urlopen(url).read() raw = nltk.clean_html(html) #nltk.download(‘punkt’) sent_tokenizer=nltk.data.load('tokenizers/punkt/ english.pickle') sents = sent_tokenizer.tokenize(raw) token = nltk.word_tokenize(sents[1]) #nltk.download(‘maxent_treebank_pos_tagger’) pos = nltk.pos_tag(token)
  • 41.
    Demo1 #!/usr/bin/env python import nltk fromurllib import urlopen url="http://www.voanews.com/articleprintview/1587223.html" html = urlopen(url).read() raw = nltk.clean_html(html) 清除html  tag #nltk.download(‘punkt’) sent_tokenizer=nltk.data.load('tokenizers/punkt/ english.pickle') sents = sent_tokenizer.tokenize(raw) token = nltk.word_tokenize(sents[1]) #nltk.download(‘maxent_treebank_pos_tagger’) pos = nltk.pos_tag(token)
  • 42.
    Demo1 #!/usr/bin/env python import nltk fromurllib import urlopen url="http://www.voanews.com/articleprintview/1587223.html" html = urlopen(url).read() raw = nltk.clean_html(html) #nltk.download(‘punkt’) sent_tokenizer=nltk.data.load('tokenizers/punkt/ english.pickle') sents = sent_tokenizer.tokenize(raw) 斷句 token = nltk.word_tokenize(sents[1]) #nltk.download(‘maxent_treebank_pos_tagger’) pos = nltk.pos_tag(token)
  • 43.
    Demo1 #!/usr/bin/env python import nltk fromurllib import urlopen url="http://www.voanews.com/articleprintview/1587223.html" html = urlopen(url).read() raw = nltk.clean_html(html) #nltk.download(‘punkt’) sent_tokenizer=nltk.data.load('tokenizers/punkt/ english.pickle') sents = sent_tokenizer.tokenize(raw) token = nltk.word_tokenize(sents[1]) 分詞 #nltk.download(‘maxent_treebank_pos_tagger’) pos = nltk.pos_tag(token)
  • 44.
    Demo1 #!/usr/bin/env python import nltk fromurllib import urlopen url="http://www.voanews.com/articleprintview/1587223.html" html = urlopen(url).read() raw = nltk.clean_html(html) #nltk.download(‘punkt’) sent_tokenizer=nltk.data.load('tokenizers/punkt/ english.pickle') sents = sent_tokenizer.tokenize(raw) token = nltk.word_tokenize(sents[1]) #nltk.download(‘maxent_treebank_pos_tagger’) pos = nltk.pos_tag(token) 詞性標記
  • 45.
    Demo2 #!/usr/bin/env python #nltk.download(‘treebank’) import nltk fromnltk.corpus import treebank from nltk.grammar import ContextFreeGrammar, Nonterminal from nltk.parse import ChartParser productions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions()) grammar = ContextFreeGrammar(Nonterminal('S'),productions) parser = ChartParser(grammar) parsed_tree = parser.parse(treebank.sents()[0]) # print parsed_tree
  • 46.
    Demo2 #!/usr/bin/env python #nltk.download(‘treebank’) import nltk fromnltk.corpus import treebank from nltk.grammar import ContextFreeGrammar, Nonterminal from nltk.parse import ChartParser productions = set( treebank production production for sent in treebank.parsed_sents()[0:9] for production in sent.productions()) grammar = ContextFreeGrammar(Nonterminal('S'),productions) parser = ChartParser(grammar) parsed_tree = parser.parse(treebank.sents()[0]) # print parsed_tree
  • 47.
    Demo2 #!/usr/bin/env python #nltk.download(‘treebank’) import nltk fromnltk.corpus import treebank from nltk.grammar import ContextFreeGrammar, Nonterminal from nltk.parse import ChartParser productions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions()) encoder grammar grammar = ContextFreeGrammar(Nonterminal('S'),productions) parser = ChartParser(grammar) parsed_tree = parser.parse(treebank.sents()[0]) # print parsed_tree
  • 48.
    Demo2 #!/usr/bin/env python #nltk.download(‘treebank’) import nltk fromnltk.corpus import treebank from nltk.grammar import ContextFreeGrammar, Nonterminal from nltk.parse import ChartParser productions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions()) grammar = ContextFreeGrammar(Nonterminal('S'),productions) parser = ChartParser(grammar) 產生parser parsed_tree = parser.parse(treebank.sents()[0]) # print parsed_tree
  • 49.
    Demo3 #!/usr/bin/env python #nltk.download(‘reuters’) import nltk fromnltk.probability import FreqDist from nltk.probability import ConditionalFreqDist from nltk.corpus import reuters from nltk.corpus import brown fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50])) #fd.tabulate(10) #fd.plot() cdf = ConditionalFreqDist((corpus, word) for corpus in ['reuters', 'brown'] for word in eval(corpus).words() if word in map(str,range(1900,1950,5))) #cdf.conditions() #cdf['brown']['1910'] #cdf.tabulate() #cdf.plot()
  • 50.
    Demo3 #!/usr/bin/env python #nltk.download(‘reuters’) import nltk fromnltk.probability import FreqDist from nltk.probability import ConditionalFreqDist from nltk.corpus import reuters from nltk.corpus import brown 詞頻統計 fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50])) #fd.tabulate(10) #fd.plot() cdf = ConditionalFreqDist((corpus, word) for corpus in ['reuters', 'brown'] for word in eval(corpus).words() if word in map(str,range(1900,1950,5))) #cdf.conditions() #cdf['brown']['1910'] #cdf.tabulate() #cdf.plot()
  • 51.
    Demo3 #!/usr/bin/env python #nltk.download(‘reuters’) import nltk fromnltk.probability import FreqDist from nltk.probability import ConditionalFreqDist from nltk.corpus import reuters from nltk.corpus import brown fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50])) #fd.tabulate(10) #fd.plot() 不同條件的詞頻統計 cdf = ConditionalFreqDist((corpus, word) (Conditions and Events) for corpus in ['reuters', 'brown'] for word in eval(corpus).words() if word in map(str,range(1900,1950,5))) #cdf.conditions() #cdf['brown']['1910'] #cdf.tabulate() #cdf.plot()
  • 52.