Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
7,752
On Slideshare
7,496
From Embeds
256
Number of Embeds
9

Actions

Shares
Downloads
49
Comments
0
Likes
16

Embeds 256

http://ameblo.jp 222
https://twitter.com 18
http://twitter.com 4
http://a0.twimg.com 4
http://feedblog.ameba.jp 4
http://static.slidesharecdn.com 1
http://www.mefeedia.com 1
https://si0.twimg.com 1
http://translate.googleusercontent.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1
  • 2. • Java • Python NLTK • elif 2
  • 3. • Python Natural Language Processing : NLP - • Python 3
  • 4. • MeCab • NLTK • Hadoop (Hadoop Streaming) 4
  • 5. • •mecab.py MeCab • •markov.py Markov • •freq.py n-gram • •map.py / reduce.py freq.py Hadoop • •(bmc_sample.py) 2-gram • Baidu http://www.baidu.jp/corpus/ 5
  • 6. • http://mecab.sourceforge.net/ • • Mac spotlight • C Python/Perl/Ruby/Java 6
  • 7. $mecab , , , ,*,*, , , , , ,*,*,*, , , , , ,*,*,*, , , ,*,*,*,*,*, , , , ,*,*,*,*, , , , , ,*,*,*, , , , ,*,*, , , , , EOS 7
  • 8. import sys import MeCab import nltk if __name__ == "__main__": file = sys.argv[1] #read file raw = open(file).read() # #split word m = MeCab.Tagger("-Ochasen") #MeCab node = m.parseToNode(raw) # node = node.next while node: print node.surface, node.feature # node = node.next 8
  • 9. • Python http://www.nltk.org/ • 9
  • 10. • ... • n- gram 10
  • 11. #trigram(3-gram) Markov def markovgen(words,length): text = nltk.Text(words) # Token gen = text.generate(length) #trigram/Markov print gen 11
  • 12. / / / / / / / 100 / Bye Bye Guitar 12
  • 13. 13
  • 14. • Wikipedia http://ja.wikipedia.org/wiki/%E3%83%9E%E3%83%AB%E3%82%B3%E3%83%95%E9%81%8E %E7%A8%8B • • bot bot : http://gigazine.net/index.php?/news/comments/20090709_markov_chain/ 14
  • 15. # n-gram def ngrams(words, ngram, limit): ngrams = nltk.ngrams(words, ngram) # n-gram fd = nltk.FreqDist(ngrams) # result = {} for f in fd: #FreqDist dictionary ) r = "" for n in range(0, ngram): if n > 0: r += " " r += f[n] result[r] = fd[f] c=0# for k,v in sorted(result.items(), key=lambda x:x[1], reverse=True): c += 1 if limit > 0: if c > limit: break print k + "t" + str(result[k]) 15
  • 16. 16
  • 17. • Wikipedia http://ja.wikipedia.org/wiki/%E5%85%A8%E6%96%87%E6%A4%9C%E7%B4%A2#N-Gram • N • 17
  • 18. #2-gram freq.py # 10 #Baidu $ python bmc_sample.py text/2gram.txt 0.107754 (L: L: R: L: L: L: L: L: R: L: ) 0.089521 (L: L: L: L: R: L: R: L: L: L: ) 0.083796 (L: L: R: L: R: L: R: L: L: R: ) 0.079821 (L: R: L: R: L: L: L: L: L: L: ) 0.077502 (L: R: L: L: L: L: R: L: L: L: ) 0.076430 (L: L: L: L: L: L: L: L: ) 0.071117 (L: L: L: L: L: R: R: L: ) 0.070258 (L: R: L: R: L: L: L: L: R: R: ) 0.064601 (L: L: L: R: R: ) 0.060459 (L: R: L: L: L: L: L: L: L: L: ) 18
  • 19. • _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _2038_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ - 19
  • 20. • http://hadoop.apache.org/ • • Hadoop Streaming Python Map/Reduce 20
  • 21. $hadoop jar ${HADOOP_HOME}/contrib/streaming/hadoop-0.20.x-streaming.jar -mapper "python map.py" -reducer "python reduce.py" -input /python/input/ -output /python/output -mapper:Map -reducer:Reduce -input: / -output: ※ JobConf http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Streaming+Options+and+Usage 21
  • 22. if __name__ == "__main__": for line in sys.stdin: # line = line.strip() words = parse(line) fd = nltk.FreqDist(words) for f in fd: print f + “¥t” fd[f] # Key/Value 22
  • 23. if __name__ == "__main__": word2count = {} for line in sys.stdin: # Key/Value line = line.strip() word, count = line.split('t', 1) try: count = int(count) word2count[word] = word2count.get(word, 0) + count except ValueError: pass sorted_word2count = sorted(word2count.items(), key=itemgetter(0)) for word, count in sorted_word2count: print '%st%s'% (word, count) # Key/Value 23
  • 24. • NLTK Python - • 24