Playing with Natural Language ― in a friendly way
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Playing with Natural Language ― in a friendly way

on

  • 697 views

https://www.youtube.com/watch?v=4kTqXtJjiuQ

https://www.youtube.com/watch?v=4kTqXtJjiuQ

Statistics

Views

Total Views
697
Views on SlideShare
472
Embed Views
225

Actions

Likes
3
Downloads
8
Comments
0

3 Embeds 225

http://tw.use-r.net 144
http://rueshyna.wordpress.com 80
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Playing with Natural Language ― in a friendly way Presentation Transcript

  • 1. Playing  with  Natural  Language ―  in  a  friendly  way ML/DM Monday 2013/01/28
  • 2. 我是誰✓ 蔡家琦 (Tsai, Chia-Chi)✓ ID : Rueshyna (Rues)✓ 工作 : Ruby on Rails✓ Machine Learning & Text Mining
  • 3. 什麼是NLP?
  • 4. Natural  language   processing
  • 5. Natural  language   processing 自然語言處理
  • 6. Natural  language   processing 自然語言處理 人類語言
  • 7. ML/DM + 語言學
  • 8. 分析層次
  • 9. 語意 文法 問題文字
  • 10. 讓電腦瞭解人類的語言
  • 11. 什麼是NLTK?
  • 12. Natural Language ToolKit
  • 13. Natural Language ToolKitPython 函式庫
  • 14. Natural Language ToolKitPython 函式庫快速處理 NLP 的問題
  • 15. Python 2 or 3?
  • 16. Python3目前沒有標準版
  • 17. 但有開發版還在努力中...
  • 18. Python2 標準版
  • 19. OPENhttp://nltk.org/book/
  • 20. 官網http://nltk.org/
  • 21. 安裝pip install pyyaml nltk
  • 22. Natural Language 議題
  • 23. 斷句Today many historians think that only about twenty percentof the colonists supported Britain. Some colonists supportedwhichever side seemed to be winning. -- from VOA
  • 24. 斷句Today many historians think that only about twenty percentof the colonists supported Britain. Some colonists supportedwhichever side seemed to be winning. -- from VOA 利用“.”•Today many historians think that only about twentypercent of the colonists supported Britain.•Some colonists supported whichever side seemed to bewinning.
  • 25. 斷句Iowa-based political committee in 2007 and has grown largersince taking a leading role now against Mr. Hagel. “Postelectionwe have new battle lines being drawn with the president; hekicks it off with these nominations and it made sense for us.” -- from New York Times
  • 26. 斷句Iowa-based political committee in 2007 and has grown largersince taking a leading role now against Mr. Hagel. “Postelectionwe have new battle lines being drawn with the president; hekicks it off with these nominations and it made sense for us.” -- from New York Times 利用“.”•Iowa-based political committee in 2007 and has grown larger since taking aleading role now against Mr.•Hagel.•“Postelection we have new battle lines being drawn with the president; hekicks it off with these nominations and it made sense for us.•”
  • 27. 分詞(tokenization)Today is a beautiful day
  • 28. 分詞(tokenization)Today is a beautiful day 利用空白[Today] [is] [a] [beautiful] [day]
  • 29. 分詞(tokenization) Today is a beautiful day 利用空白 [Today] [is] [a] [beautiful] [day]beautiful day. “Who knows?” $50 industry’s for youths;
  • 30. 分詞(tokenization) Today is a beautiful day 利用空白 [Today] [is] [a] [beautiful] [day]beautiful day. $50 ? for youths; “Who knows?” industry’s
  • 31. 詞性(pos)Pierre Vinken , 61 years old , will join theboard as a nonexecutive director Nov. 29 .
  • 32. 詞性(pos)Pierre Vinken , 61 years old , will join theboard as a nonexecutive director Nov. 29 .Pierre/NNP join/VB Nov./NNPVinken/NNP the/DT 29/CD61/CD board/NN ./.years/NNS as/IN,/, a/DTold/JJ nonexecutive/JJwill/MD director/NN penn treebank tag
  • 33. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 34. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 35. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 36. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 37. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 38. Corpus ✓ Treebank ✓ 純文字、詞性標記、剖析樹 ✓ Brown ✓ 純文字、詞性標記、剖析樹 ✓ 路透社(Reuters) ✓ 純文字
  • 39. Demo
  • 40. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  • 41. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html) 清除html  tag#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  • 42. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw) 斷句token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  • 43. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1]) 分詞#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  • 44. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token) 詞性標記
  • 45. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions())grammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar)parsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  • 46. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( treebank production production for sent in treebank.parsed_sents()[0:9] for production in sent.productions())grammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar)parsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  • 47. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions()) encoder grammargrammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar)parsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  • 48. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions())grammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar) 產生parserparsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  • 49. Demo3#!/usr/bin/env python#nltk.download(‘reuters’)import nltkfrom nltk.probability import FreqDistfrom nltk.probability import ConditionalFreqDistfrom nltk.corpus import reutersfrom nltk.corpus import brownfd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))#fd.tabulate(10)#fd.plot()cdf = ConditionalFreqDist((corpus, word) for corpus in [reuters, brown] for word in eval(corpus).words() if word in map(str,range(1900,1950,5)))#cdf.conditions()#cdf[brown][1910]#cdf.tabulate()#cdf.plot()
  • 50. Demo3#!/usr/bin/env python#nltk.download(‘reuters’)import nltkfrom nltk.probability import FreqDistfrom nltk.probability import ConditionalFreqDistfrom nltk.corpus import reutersfrom nltk.corpus import brown 詞頻統計fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))#fd.tabulate(10)#fd.plot()cdf = ConditionalFreqDist((corpus, word) for corpus in [reuters, brown] for word in eval(corpus).words() if word in map(str,range(1900,1950,5)))#cdf.conditions()#cdf[brown][1910]#cdf.tabulate()#cdf.plot()
  • 51. Demo3#!/usr/bin/env python#nltk.download(‘reuters’)import nltkfrom nltk.probability import FreqDistfrom nltk.probability import ConditionalFreqDistfrom nltk.corpus import reutersfrom nltk.corpus import brownfd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))#fd.tabulate(10)#fd.plot() 不同條件的詞頻統計cdf = ConditionalFreqDist((corpus, word) (Conditions and Events) for corpus in [reuters, brown] for word in eval(corpus).words() if word in map(str,range(1900,1950,5)))#cdf.conditions()#cdf[brown][1910]#cdf.tabulate()#cdf.plot()
  • 52. Thanks!!