200行で書けるテキスト分類

10,716 views

Published on

だいnltkstudyで発表

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,716
On SlideShare
0
From Embeds
0
Number of Embeds
699
Actions
Shares
0
Downloads
47
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

200行で書けるテキスト分類

  1. 1. 200 Python x Twitter x MeCab x nltk Kauli @m1m0r12011 7 26
  2. 2. id: t_mimori, @m1m0r1 Work at Kauli web ( ) : Python2011 7 26
  3. 3. Python Twitter https://gist.github.com/11051572011 7 26
  4. 4. 2011 7 26
  5. 5. 1. 2. 3. 4. 5.2011 7 26
  6. 6. 1. 2. 3. 4. 5.2011 7 26
  7. 7. .1 Twitter API http://code.google.com/p/python-twitter/ pip install python-twitter Twitter API https://dev.twitter.com/ create an app API http://dev.twitter.com/apps/ Access Token2011 7 26
  8. 8. .2 python-twitter import twitter CONSUMER_KEY = CONSUMER_SECRET = ACCESS_TOKEN_KEY = ACCESS_TOKEN_SECRET = def get_twitter_api():   api = twitter.Api(consumer_key=CONSUMER_KEY,                     consumer_secret=CONSUMER_SECRET,                     access_token_key=ACCESS_TOKEN_KEY,                     access_token_secret=ACCESS_TOKEN_SECRET)   #api.VerifyCredentials()   return api def get_timeline(id, count=100, page=1):   # get user status   api = get_twitter_api()   statuses = api.GetUserTimeline(id=id, count=count, page=page) return [status.text for status in statuses]2011 7 26
  9. 9. .3 get_timeline(“utadahikaru”) Yahoo! (@u3music) →http://t.co/4cVGdK12011 7 26
  10. 10. .4 API 350/h get_timeline ※ TokyoCabinet class Twitter(object):   def __init__(self, datafile=tweets.tch):     import tc     self.db = tc.HDB(datafile, tc.HDBOWRITER | tc.HDBOCREAT) tweets.tch   def close(self):     self.db.close()   def get_timeline(self, id, count=100, update=False):     import json     if (id not in self.db) or update:       print Fetching %ss timeline.. % (id),       try:         timeline = get_timeline(id)         print name = %s. % (timeline[name])         self.db[id] = json.dumps(timeline)       except twitter.TwitterError, e:         print e         self.db[id] = json.dumps({})     return json.loads(self.db[id])2011 7 26
  11. 11. 1. 2. 3. 4. 5.2011 7 26
  12. 12. html extractcontent twitter @mention http://bit.ly/.. import re Yahoo! (@u3music) re_mention = re.compile(r@w+) re_url = re.compile(rhttp://[w./?=&#+-]+) def cleanup_tweet(tweet):   tweet = re_mention.sub(, tweet)   tweet = re_url.sub(, tweet) →http://t.co/   return tweet 4cVGdK12011 7 26
  13. 13. 1. 2. 3. 4. 5.2011 7 26
  14. 14. .1 MeCab (Mac brew install mecab ) mecab-python http://sourceforge.net/projects/mecab/files/2011 7 26
  15. 15. .2 tagger = MeCabTagger() tagger.text2words(u“ Yahoo! () class MeCabTagger(object):   def __init__(self):     import MeCab     self._tagger = MeCab.Tagger() →   def get_word(self, surface, feature): ”)     fs = feature.split(",")     if fs[0] == and fs[1] not in ( , ):       return fs[6] != * and fs[6] or surface   def text2words(self, text, kinds=None):     try:       node = self._tagger.parseToNode(text.encode(utf-8)) Yahoo     except RuntimeError, e:       raise e     words = []     while node:       word = self.get_word(node.surface, node.feature)       if word:         words.append(word)       node = node.next     return words2011 7 26
  16. 16. 1. 2. 3. 4. 5.2011 7 26
  17. 17. Yahoo { { “Yahoo”: 1, “Yahoo”: 1.5, “ ”: 2, “ ”: 5, “ ”: 1, .. “ ”: 1.1, .. } } ※TF-IDF2011 7 26
  18. 18. 1. 2. 3. 4. 5.2011 7 26
  19. 19. nltk.NaiveBayesClassifier * classifier = nltk.NaiveBayesClassifier.train([ ( 1, “A”), ( 2, “B”), ( 3, “A”), ( 4, “C”), ..]) * classifier.classify({ “Yahoo”: 1.5, “ ”: 5, “B” “ ”: 1.1, .. })2011 7 26
  20. 20. http://d.hatena.ne.jp/aidiary/ 20100613/12763893372011 7 26
  21. 21. http://politter.com/list/ twitter2011 7 26
  22. 22. http://politter.com/list/ twitter 210 95 Hiro_Ishikawa demezo fukudachie train test Hori2009 IsamuUeda shinochan55 ikeda_mari classifier JunyaNISHIMURA MariOkada tadamori_oshima NextJapanLDP TAIRAMASAAKI SatoYukari konotarogomame T_Ogawa_DPJ tanigaki_s ... ...2011 7 26
  23. 23. $ py tweetclass.py classify train.txt hatoyamayukio train.txt .. 210 hatoyamayukio ( ) $ py tweetclass.py classify train.txt konotarogomame train.txt .. 210 konotarogomame ( )2011 7 26
  24. 24. $ py tweetclass.py test train.txt test.txt train.txt .. 210 0.421485445764 1/3 210 x 100 60%2011 7 26
  25. 25. 2011 7 26

×