Playing with Natural Language ― in a friendly way

932 views

Published on

https://www.youtube.com/watch?v=4kTqXtJjiuQ

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
932
On SlideShare
0
From Embeds
0
Number of Embeds
250
Actions
Shares
0
Downloads
11
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Playing with Natural Language ― in a friendly way

  1. 1. Playing  with  Natural  Language ―  in  a  friendly  way ML/DM Monday 2013/01/28
  2. 2. 我是誰✓ 蔡家琦 (Tsai, Chia-Chi)✓ ID : Rueshyna (Rues)✓ 工作 : Ruby on Rails✓ Machine Learning & Text Mining
  3. 3. 什麼是NLP?
  4. 4. Natural  language   processing
  5. 5. Natural  language   processing 自然語言處理
  6. 6. Natural  language   processing 自然語言處理 人類語言
  7. 7. ML/DM + 語言學
  8. 8. 分析層次
  9. 9. 語意 文法 問題文字
  10. 10. 讓電腦瞭解人類的語言
  11. 11. 什麼是NLTK?
  12. 12. Natural Language ToolKit
  13. 13. Natural Language ToolKitPython 函式庫
  14. 14. Natural Language ToolKitPython 函式庫快速處理 NLP 的問題
  15. 15. Python 2 or 3?
  16. 16. Python3目前沒有標準版
  17. 17. 但有開發版還在努力中...
  18. 18. Python2 標準版
  19. 19. OPENhttp://nltk.org/book/
  20. 20. 官網http://nltk.org/
  21. 21. 安裝pip install pyyaml nltk
  22. 22. Natural Language 議題
  23. 23. 斷句Today many historians think that only about twenty percentof the colonists supported Britain. Some colonists supportedwhichever side seemed to be winning. -- from VOA
  24. 24. 斷句Today many historians think that only about twenty percentof the colonists supported Britain. Some colonists supportedwhichever side seemed to be winning. -- from VOA 利用“.”•Today many historians think that only about twentypercent of the colonists supported Britain.•Some colonists supported whichever side seemed to bewinning.
  25. 25. 斷句Iowa-based political committee in 2007 and has grown largersince taking a leading role now against Mr. Hagel. “Postelectionwe have new battle lines being drawn with the president; hekicks it off with these nominations and it made sense for us.” -- from New York Times
  26. 26. 斷句Iowa-based political committee in 2007 and has grown largersince taking a leading role now against Mr. Hagel. “Postelectionwe have new battle lines being drawn with the president; hekicks it off with these nominations and it made sense for us.” -- from New York Times 利用“.”•Iowa-based political committee in 2007 and has grown larger since taking aleading role now against Mr.•Hagel.•“Postelection we have new battle lines being drawn with the president; hekicks it off with these nominations and it made sense for us.•”
  27. 27. 分詞(tokenization)Today is a beautiful day
  28. 28. 分詞(tokenization)Today is a beautiful day 利用空白[Today] [is] [a] [beautiful] [day]
  29. 29. 分詞(tokenization) Today is a beautiful day 利用空白 [Today] [is] [a] [beautiful] [day]beautiful day. “Who knows?” $50 industry’s for youths;
  30. 30. 分詞(tokenization) Today is a beautiful day 利用空白 [Today] [is] [a] [beautiful] [day]beautiful day. $50 ? for youths; “Who knows?” industry’s
  31. 31. 詞性(pos)Pierre Vinken , 61 years old , will join theboard as a nonexecutive director Nov. 29 .
  32. 32. 詞性(pos)Pierre Vinken , 61 years old , will join theboard as a nonexecutive director Nov. 29 .Pierre/NNP join/VB Nov./NNPVinken/NNP the/DT 29/CD61/CD board/NN ./.years/NNS as/IN,/, a/DTold/JJ nonexecutive/JJwill/MD director/NN penn treebank tag
  33. 33. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  34. 34. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  35. 35. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  36. 36. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  37. 37. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  38. 38. Corpus ✓ Treebank ✓ 純文字、詞性標記、剖析樹 ✓ Brown ✓ 純文字、詞性標記、剖析樹 ✓ 路透社(Reuters) ✓ 純文字
  39. 39. Demo
  40. 40. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  41. 41. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html) 清除html  tag#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  42. 42. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw) 斷句token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  43. 43. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1]) 分詞#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  44. 44. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token) 詞性標記
  45. 45. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions())grammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar)parsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  46. 46. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( treebank production production for sent in treebank.parsed_sents()[0:9] for production in sent.productions())grammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar)parsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  47. 47. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions()) encoder grammargrammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar)parsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  48. 48. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions())grammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar) 產生parserparsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  49. 49. Demo3#!/usr/bin/env python#nltk.download(‘reuters’)import nltkfrom nltk.probability import FreqDistfrom nltk.probability import ConditionalFreqDistfrom nltk.corpus import reutersfrom nltk.corpus import brownfd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))#fd.tabulate(10)#fd.plot()cdf = ConditionalFreqDist((corpus, word) for corpus in [reuters, brown] for word in eval(corpus).words() if word in map(str,range(1900,1950,5)))#cdf.conditions()#cdf[brown][1910]#cdf.tabulate()#cdf.plot()
  50. 50. Demo3#!/usr/bin/env python#nltk.download(‘reuters’)import nltkfrom nltk.probability import FreqDistfrom nltk.probability import ConditionalFreqDistfrom nltk.corpus import reutersfrom nltk.corpus import brown 詞頻統計fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))#fd.tabulate(10)#fd.plot()cdf = ConditionalFreqDist((corpus, word) for corpus in [reuters, brown] for word in eval(corpus).words() if word in map(str,range(1900,1950,5)))#cdf.conditions()#cdf[brown][1910]#cdf.tabulate()#cdf.plot()
  51. 51. Demo3#!/usr/bin/env python#nltk.download(‘reuters’)import nltkfrom nltk.probability import FreqDistfrom nltk.probability import ConditionalFreqDistfrom nltk.corpus import reutersfrom nltk.corpus import brownfd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))#fd.tabulate(10)#fd.plot() 不同條件的詞頻統計cdf = ConditionalFreqDist((corpus, word) (Conditions and Events) for corpus in [reuters, brown] for word in eval(corpus).words() if word in map(str,range(1900,1950,5)))#cdf.conditions()#cdf[brown][1910]#cdf.tabulate()#cdf.plot()
  52. 52. Thanks!!

×