Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

16,507 views

Published on

Published in: Technology, Design
  • Be the first to comment

Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

  1. 1. Hadoop/Mahout/HBase 2011/04/10 #TokyoWebmining10-2 yanaoki2011 4 10
  2. 2. • • HBase • Mahout • Naive Bayes • • Web2011 4 10
  3. 3. • • naoki yanai • • • … • • • Hadoop • •2011 4 10
  4. 4. HBase • KeyValue • read/write • goal is the hosting of very large tables -- billions of rows , millions of columns ... • Hadoop • CAP C,P • C: ,A: ,P: • Sharding • Hadoop/MapReduce2011 4 10
  5. 5. HBase • • ― • ― • qualifier2011 4 10
  6. 6. Mahout • • Hadoop • • HBase • • • Classifier / Clustering / Pattern Mining • Recommenders / Collaborative Filtering • Evolutionary Algorithms ...2011 4 10
  7. 7. Mahout • • • • Mahout • Mahout in Action PDF • hamadakoichi • TokyoWebmining2011 4 10
  8. 8. Naive Bayes • F1,...,Fn C • C •2011 4 10
  9. 9. Naive Bayes • • • • • • •2011 4 10
  10. 10. Naive Bayes • • • • • • • •2011 4 10
  11. 11. • Web • • • • •2011 4 10
  12. 12. 2011 4 10
  13. 13. • Ruby • ExtractContent require "open-uri" require "extractcontent" html = open("http:// news.nifty.com/....htm").read body, title = ExtractContent::analyse(html) puts body.toutf8 #=> HTML2011 4 10
  14. 14. • Ruby • scrAPI require scrapi require open-uri scr = Scraper.define do process "div.tweet", "tweets[]"=> :text result :tweets end tweets = scr.scrape(URI.parse("http://togetter.com/li/ 121476"), :parser_options => {:char_encoding => utf8}) tweets.each{ |tw| puts tw } #=>2011 4 10
  15. 15. • RSS HBase • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ … http://ameblo.jp/....html KARA …2011 4 10
  16. 16. • HBase category_id <TAB> • HBase MaprReduce HDFS • • • • Wikipedia •2011 4 10
  17. 17. • mahout $ mahout trainclassifier ... $ mahout testclassifier … • mahout • --input/--output / • --dataSource HDFS HBase • --gramSize N-gram • --classifierType • --alpha • --minDF/--minSupport /2011 4 10
  18. 18. • HBase • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :       1884       82.2348% Incorrectly Classified Instances        :        407       17.7652% Total Classified Instances              :       2291 ======================================================= Confusion Matrix ------------------------------------------------------- a       b       c       d       e       <--Classified as 216     32      22      155     0        |  425         a     = t 0       514     13      70      0        |  597         b     = s 0       2       514     9       0        |  525         c     = e 1       8       13      638     0        |  660         d     = b 0       0       67      15      2        |  84          e     = a Default Category: unknown: 52011 4 10
  19. 19. • • reducer HBase // BayesParameters params = new BayesParameters(); params.set("alpha_i", "1"); algorithm = new CBayesAlgorithm(); datastore = new HBaseBayesDatastore("model_table_name", params); classifier = new ClassifierContext(algorithm, datastore); // ClassifierResult category = classifier.classifyDocument(doc.toArray(new String [doc.size()]), "default"); String label = category.getLabel();2011 4 10
  20. 20. • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ category:cat=”technology” … http://ameblo.jp/....html KARA … category:cat=”entertainment”2011 4 10
  21. 21. Web2011 4 10
  22. 22. Web • Google News Togetter RSS • • … • … • a 935 5.2M b 5,112 7.2M e 3,746 8.1M s 4,737 12M t 3,969 9.2M2011 4 10
  23. 23. Web • • • + 56.8% 65.38%2011 4 10
  24. 24. Web • • • 1 0.5 0.1 0.01 0.001 65.38% 65.83% 66.73% 66.82% 67.02%2011 4 10
  25. 25. • • • HBase/Mahout • • HBase2011 4 10
  26. 26. 2011 4 10

×