Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Apache Thrift Outline
Apache Thrift Outline
Loading in …3
×
1 of 29

Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

15

Share

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

  1. 1. Hadoop/Mahout/HBase 2011/04/10 #TokyoWebmining10-2 yanaoki 2011 4 18
  2. 2. • • HBase • Mahout • Naive Bayes • • Web 2011 4 18
  3. 3. • • naoki yanai • • • … • • • Hadoop • • 2011 4 18
  4. 4. HBase • KeyValue • read/write • goal is the hosting of very large tables -- billions of rows , millions of columns ... • Hadoop • CAP C,P • C: ,A: ,P: • Sharding • Hadoop/MapReduce 2011 4 18
  5. 5. HBase • • ― • ― • qualifier 2011 4 18
  6. 6. Mahout • • Hadoop • • HBase • • • Classifier / Clustering / Pattern Mining • Recommenders / Collaborative Filtering • Evolutionary Algorithms ... 2011 4 18
  7. 7. Mahout • • • • Mahout • Mahout in Action PDF • hamadakoichi • TokyoWebmining 2011 4 18
  8. 8. Naive Bayes • F1,...,Fn C • C • 2011 4 18
  9. 9. Naive Bayes • • • • • • • 2011 4 18
  10. 10. Naive Bayes • • • • • • • • 2011 4 18
  11. 11. • Web • • • • • 2011 4 18
  12. 12. 2011 4 18
  13. 13. • Ruby • ExtractContent require "open-uri" require "extractcontent" html = open("http:// news.nifty.com/....htm").read body, title = ExtractContent::analyse(html) puts body.toutf8 #=> HTML 2011 4 18
  14. 14. • Ruby • scrAPI require 'scrapi' require 'open-uri' scr = Scraper.define do process "div.tweet", "tweets[]"=> :text result :tweets end tweets = scr.scrape(URI.parse("http://togetter.com/li/ 121476"), :parser_options => {:char_encoding => 'utf8'}) tweets.each{ |tw| puts tw } #=> 2011 4 18
  15. 15. • RSS HBase • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ … http://ameblo.jp/....html KARA … 2011 4 18
  16. 16. • HBase category_id <TAB> • HBase MaprReduce HDFS • • • • Wikipedia • 2011 4 18
  17. 17. • mahout $ mahout trainclassifier ... $ mahout testclassifier … • mahout • --input/--output / • --dataSource HDFS HBase • --gramSize N-gram • --classifierType • --alpha • --minDF/--minSupport / 2011 4 18
  18. 18. • HBase • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :       1884       82.2348% Incorrectly Classified Instances        :        407       17.7652% Total Classified Instances              :       2291 ======================================================= Confusion Matrix ------------------------------------------------------- a       b       c       d       e       <--Classified as 216     32      22      155     0        |  425         a     = t 0       514     13      70      0        |  597         b     = s 0       2       514     9       0        |  525         c     = e 1       8       13      638     0        |  660         d     = b 0       0       67      15      2        |  84          e     = a Default Category: unknown: 5 2011 4 18
  19. 19. • • reducer HBase // BayesParameters params = new BayesParameters(); params.set("alpha_i", "1"); algorithm = new CBayesAlgorithm(); datastore = new HBaseBayesDatastore("model_table_name", params); classifier = new ClassifierContext(algorithm, datastore); // ClassifierResult category = classifier.classifyDocument(doc.toArray(new String [doc.size()]), "default"); String label = category.getLabel(); 2011 4 18
  20. 20. • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ category:cat=”technology” … http://ameblo.jp/....html KARA … category:cat=”entertainment” 2011 4 18
  21. 21. Web 2011 4 18
  22. 22. Web • Google News Togetter RSS • • … • … • a 935 5.2M b 5,112 7.2M e 3,746 8.1M s 4,737 12M t 3,969 9.2M 2011 4 18
  23. 23. 4/18 Web • • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :      13388        91.6798% Incorrectly Classified Instances        :       1215         8.3202% Total Classified Instances              :      14603 ======================================================= Confusion Matrix ------------------------------------------------------- a         b         c         d         e         <--Classified as 2328      19        515       250       0          |  3112       a     = t 3         2939      54        20        0          |  3016       b     = e 32        3         3542      109       0          |  3686       c     = s 33        16        128       3877      0          |  4054       d     = b 1         27        2         3         702        |  735        e     = a Default Category: unknown: 5 2011 4 18
  24. 24. Web • • • alpha 1 0.5 0.1 0.01 0.001 65.38% 65.83% 66.73% 66.82% 67.02% 2011 4 18
  25. 25. 4/18 Web • • • N-Gram unigram bigram 63.57% 66.09% 2011 4 18
  26. 26. Web • • • + 56.8% 65.38% 2011 4 18
  27. 27. 4/18 Web • • • 67.02% 67.88% 2011 4 18
  28. 28. • • • HBase/Mahout • • HBase 2011 4 18
  29. 29. 2011 4 18

×