Your SlideShare is downloading. ×
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Playing with Natural Language ― in a friendly way
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Playing with Natural Language ― in a friendly way

531

Published on

https://www.youtube.com/watch?v=4kTqXtJjiuQ

https://www.youtube.com/watch?v=4kTqXtJjiuQ

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
531
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Playing  with  Natural  Language ―  in  a  friendly  way ML/DM Monday 2013/01/28
  • 2. 我是誰✓ 蔡家琦 (Tsai, Chia-Chi)✓ ID : Rueshyna (Rues)✓ 工作 : Ruby on Rails✓ Machine Learning & Text Mining
  • 3. 什麼是NLP?
  • 4. Natural  language   processing
  • 5. Natural  language   processing 自然語言處理
  • 6. Natural  language   processing 自然語言處理 人類語言
  • 7. ML/DM + 語言學
  • 8. 分析層次
  • 9. 語意 文法 問題文字
  • 10. 讓電腦瞭解人類的語言
  • 11. 什麼是NLTK?
  • 12. Natural Language ToolKit
  • 13. Natural Language ToolKitPython 函式庫
  • 14. Natural Language ToolKitPython 函式庫快速處理 NLP 的問題
  • 15. Python 2 or 3?
  • 16. Python3目前沒有標準版
  • 17. 但有開發版還在努力中...
  • 18. Python2 標準版
  • 19. OPENhttp://nltk.org/book/
  • 20. 官網http://nltk.org/
  • 21. 安裝pip install pyyaml nltk
  • 22. Natural Language 議題
  • 23. 斷句Today many historians think that only about twenty percentof the colonists supported Britain. Some colonists supportedwhichever side seemed to be winning. -- from VOA
  • 24. 斷句Today many historians think that only about twenty percentof the colonists supported Britain. Some colonists supportedwhichever side seemed to be winning. -- from VOA 利用“.”•Today many historians think that only about twentypercent of the colonists supported Britain.•Some colonists supported whichever side seemed to bewinning.
  • 25. 斷句Iowa-based political committee in 2007 and has grown largersince taking a leading role now against Mr. Hagel. “Postelectionwe have new battle lines being drawn with the president; hekicks it off with these nominations and it made sense for us.” -- from New York Times
  • 26. 斷句Iowa-based political committee in 2007 and has grown largersince taking a leading role now against Mr. Hagel. “Postelectionwe have new battle lines being drawn with the president; hekicks it off with these nominations and it made sense for us.” -- from New York Times 利用“.”•Iowa-based political committee in 2007 and has grown larger since taking aleading role now against Mr.•Hagel.•“Postelection we have new battle lines being drawn with the president; hekicks it off with these nominations and it made sense for us.•”
  • 27. 分詞(tokenization)Today is a beautiful day
  • 28. 分詞(tokenization)Today is a beautiful day 利用空白[Today] [is] [a] [beautiful] [day]
  • 29. 分詞(tokenization) Today is a beautiful day 利用空白 [Today] [is] [a] [beautiful] [day]beautiful day. “Who knows?” $50 industry’s for youths;
  • 30. 分詞(tokenization) Today is a beautiful day 利用空白 [Today] [is] [a] [beautiful] [day]beautiful day. $50 ? for youths; “Who knows?” industry’s
  • 31. 詞性(pos)Pierre Vinken , 61 years old , will join theboard as a nonexecutive director Nov. 29 .
  • 32. 詞性(pos)Pierre Vinken , 61 years old , will join theboard as a nonexecutive director Nov. 29 .Pierre/NNP join/VB Nov./NNPVinken/NNP the/DT 29/CD61/CD board/NN ./.years/NNS as/IN,/, a/DTold/JJ nonexecutive/JJwill/MD director/NN penn treebank tag
  • 33. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 34. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 35. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 36. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 37. 剖析樹 (parsed tree)W.R. Grace holds three of Grace Energys seven board seats .
  • 38. Corpus ✓ Treebank ✓ 純文字、詞性標記、剖析樹 ✓ Brown ✓ 純文字、詞性標記、剖析樹 ✓ 路透社(Reuters) ✓ 純文字
  • 39. Demo
  • 40. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  • 41. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html) 清除html  tag#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  • 42. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw) 斷句token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  • 43. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1]) 分詞#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token)
  • 44. Demo1#!/usr/bin/env pythonimport nltkfrom urllib import urlopenurl="http://www.voanews.com/articleprintview/1587223.html"html = urlopen(url).read()raw = nltk.clean_html(html)#nltk.download(‘punkt’)sent_tokenizer=nltk.data.load(tokenizers/punkt/english.pickle)sents = sent_tokenizer.tokenize(raw)token = nltk.word_tokenize(sents[1])#nltk.download(‘maxent_treebank_pos_tagger’)pos = nltk.pos_tag(token) 詞性標記
  • 45. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions())grammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar)parsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  • 46. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( treebank production production for sent in treebank.parsed_sents()[0:9] for production in sent.productions())grammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar)parsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  • 47. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions()) encoder grammargrammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar)parsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  • 48. Demo2#!/usr/bin/env python#nltk.download(‘treebank’)import nltkfrom nltk.corpus import treebankfrom nltk.grammar import ContextFreeGrammar, Nonterminalfrom nltk.parse import ChartParserproductions = set( production for sent in treebank.parsed_sents()[0:9] for production in sent.productions())grammar = ContextFreeGrammar(Nonterminal(S),productions)parser = ChartParser(grammar) 產生parserparsed_tree = parser.parse(treebank.sents()[0])# print parsed_tree
  • 49. Demo3#!/usr/bin/env python#nltk.download(‘reuters’)import nltkfrom nltk.probability import FreqDistfrom nltk.probability import ConditionalFreqDistfrom nltk.corpus import reutersfrom nltk.corpus import brownfd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))#fd.tabulate(10)#fd.plot()cdf = ConditionalFreqDist((corpus, word) for corpus in [reuters, brown] for word in eval(corpus).words() if word in map(str,range(1900,1950,5)))#cdf.conditions()#cdf[brown][1910]#cdf.tabulate()#cdf.plot()
  • 50. Demo3#!/usr/bin/env python#nltk.download(‘reuters’)import nltkfrom nltk.probability import FreqDistfrom nltk.probability import ConditionalFreqDistfrom nltk.corpus import reutersfrom nltk.corpus import brown 詞頻統計fd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))#fd.tabulate(10)#fd.plot()cdf = ConditionalFreqDist((corpus, word) for corpus in [reuters, brown] for word in eval(corpus).words() if word in map(str,range(1900,1950,5)))#cdf.conditions()#cdf[brown][1910]#cdf.tabulate()#cdf.plot()
  • 51. Demo3#!/usr/bin/env python#nltk.download(‘reuters’)import nltkfrom nltk.probability import FreqDistfrom nltk.probability import ConditionalFreqDistfrom nltk.corpus import reutersfrom nltk.corpus import brownfd = FreqDist(map(lambda w : w.lower(), brown.words()[0:50]))#fd.tabulate(10)#fd.plot() 不同條件的詞頻統計cdf = ConditionalFreqDist((corpus, word) (Conditions and Events) for corpus in [reuters, brown] for word in eval(corpus).words() if word in map(str,range(1900,1950,5)))#cdf.conditions()#cdf[brown][1910]#cdf.tabulate()#cdf.plot()
  • 52. Thanks!!

×