SlideShare a Scribd company logo
Hadoop/Mahout/HBase



                     2011/04/10
                   #TokyoWebmining10-2

                         yanaoki



2011   4   18
•
                • HBase
                • Mahout
                • Naive Bayes
                •
                • Web

2011   4   18
•
                    •   naoki yanai
                •
                    •
                    •                 …

                •
                    •
                    •       Hadoop

                •
                    •

2011   4   18
HBase
                •   KeyValue

                    •                                                         read/write

                        •   goal is the hosting of very large tables -- billions of rows ,
                            millions of columns ...


                    •   Hadoop

                •   CAP                   C,P

                    •   C:            ,A:             ,P:

                •            Sharding

                •   Hadoop/MapReduce
2011   4   18
HBase
                •
                    •   ―

                    •   ―

                    •



                            qualifier

2011   4   18
Mahout
           •
           •    Hadoop

                •
                •                          HBase

                •
           •
                •   Classifier / Clustering / Pattern Mining

                •   Recommenders / Collaborative Filtering

                •   Evolutionary Algorithms ...
2011   4   18
Mahout

           •
           •
                •
                •   Mahout

                •   Mahout in Action PDF

                •   hamadakoichi

                •   TokyoWebmining

2011   4   18
Naive Bayes
           •        F1,...,Fn           C




           •    C




           •


2011   4   18
Naive Bayes
                •
                    •
                        •
                    •
                        •
                    •
                        •
2011   4   18
Naive Bayes
                •
                    •
                •
                    •
                •
                    •
                •
                    •
2011   4   18
•       Web

                    •
                    •
                    •
                •
                    •

2011   4   18
2011   4   18
•    Ruby

                •   ExtractContent

           require "open-uri"
           require "extractcontent"

           html = open("http://
           news.nifty.com/....htm").read
           body, title = ExtractContent::analyse(html)

           puts body.toutf8 #=>        HTML


2011   4   18
•    Ruby

                •   scrAPI


       require 'scrapi'
       require 'open-uri'

       scr = Scraper.define do
        process "div.tweet", "tweets[]"=> :text
        result :tweets
       end

       tweets = scr.scrape(URI.parse("http://togetter.com/li/
       121476"), :parser_options => {:char_encoding => 'utf8'})

       tweets.each{ |tw| puts tw } #=>


2011   4   18
•                                             RSS                      HBase


           •
                      (URL)
                                         content                         categories

       http://togetter/1.html                                  category:src=”togetter”
                                                   ...
                                                               category:cat=”social”

       http://                                                 category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                               category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/
                                                  …

       http://ameblo.jp/....html
                                   KARA …

2011   4   18
•    HBase

                    category_id <TAB>

           •    HBase           MaprReduce   HDFS

                •
                    •
                    •
                        •   Wikipedia

                    •
2011   4   18
•    mahout

                    $ mahout trainclassifier       ...

                    $ mahout testclassifier        …

           •    mahout

                •    --input/--output         /

                •    --dataSource                   HDFS   HBase

                •    --gramSize     N-gram

                •    --classifierType

                •    --alpha

                •    --minDF/--minSupport                  /

2011   4   18
•                            HBase


           •
           =======================================================
           Summary
           -------------------------------------------------------
           Correctly Classified Instances          :       1884       82.2348%
           Incorrectly Classified Instances        :        407       17.7652%
           Total Classified Instances              :       2291
           =======================================================
           Confusion Matrix
           -------------------------------------------------------
           a       b       c       d       e       <--Classified as
           216     32      22      155     0        |  425         a     = t
           0       514     13      70      0        |  597         b     = s
           0       2       514     9       0        |  525         c     = e
           1       8       13      638     0        |  660         d     = b
           0       0       67      15      2        |  84          e     = a
           Default Category: unknown: 5


2011   4   18
•
           •                                      reducer                      HBase


            //
            BayesParameters params = new BayesParameters();
            params.set("alpha_i", "1");
            algorithm = new CBayesAlgorithm();
            datastore = new HBaseBayesDatastore("model_table_name", params);
            classifier = new ClassifierContext(algorithm, datastore);

            //
            ClassifierResult category = classifier.classifyDocument(doc.toArray(new String
            [doc.size()]), "default");

            String label = category.getLabel();


2011   4   18
•

                      (URL)
                                         content                        categories

       http://togetter/1.html                                 category:src=”togetter”
                                                   ...
                                                              category:cat=”social”

       http://                                                category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                              category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/                                 category:cat=”technology”
                                                  …

       http://ameblo.jp/....html
                                   KARA …                     category:cat=”entertainment”

2011   4   18
Web




2011   4   18
Web
                •   Google News Togetter
                                   RSS

                •
                    •                              …

                    •                                         …
                •
                        a                   935        5.2M
                        b                  5,112       7.2M
                        e                  3,746       8.1M
                        s                  4,737       12M
                        t                  3,969       9.2M
2011   4   18
4/18

                                 Web
                •
                      •
                =======================================================
                Summary
                -------------------------------------------------------
                Correctly Classified Instances          :      13388        91.6798%
                Incorrectly Classified Instances        :       1215         8.3202%
                Total Classified Instances              :      14603

                =======================================================
                Confusion Matrix
                -------------------------------------------------------
                a         b         c         d         e         <--Classified as
                2328      19        515       250       0          |  3112       a       =   t
                3         2939      54        20        0          |  3016       b       =   e
                32        3         3542      109       0          |  3686       c       =   s
                33        16        128       3877      0          |  4054       d       =   b
                1         27        2         3         702        |  735        e       =   a
                Default Category: unknown: 5


2011   4   18
Web


                •
                    •
                        •                              alpha


                              1         0.5     0.1        0.01    0.001




                            65.38%   65.83%   66.73%     66.82%   67.02%


2011   4   18
4/18

                               Web


                •
                    •
                        •   N-Gram


                                     unigram   bigram


                                     63.57%    66.09%


2011   4   18
Web


                •
                    •
                        •

                                           +




                                  56.8%   65.38%


2011   4   18
4/18

                            Web


                •
                    •
                        •



                                  67.02%   67.88%


2011   4   18
•
                    •
                •               HBase/Mahout

                    •
                    •   HBase



2011   4   18
2011   4   18

More Related Content

Viewers also liked

ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
Naoki Yanai
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
takaya imai
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
Naoki Yanai
 

Viewers also liked (15)

Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみたMahoutにパッチを送ってみた
Mahoutにパッチを送ってみた
 
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
 
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6
 
Frequency Pattern Mining
Frequency Pattern MiningFrequency Pattern Mining
Frequency Pattern Mining
 
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8
 
協調フィルタリング with Mahout
協調フィルタリング with Mahout協調フィルタリング with Mahout
協調フィルタリング with Mahout
 
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9
 
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" -  #TokyoWebmining 14th"Mahout Recommendation" -  #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14th
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
 
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share
 
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
 
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
 
Appium: Automation for Mobile Apps
Appium: Automation for Mobile AppsAppium: Automation for Mobile Apps
Appium: Automation for Mobile Apps
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
 

Similar to Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
dzhou
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
Korea Sdec
 
Be nice to your designers
Be nice to your designersBe nice to your designers
Be nice to your designers
Pai-Cheng Tao
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-august
pharkmillups
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
moai kids
 

Similar to Hadoop/Mahout/HBaseで テキスト分類器を作ったよ (20)

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBase, no trouble
HBase, no troubleHBase, no trouble
HBase, no trouble
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
What's behind facebook
What's behind facebookWhat's behind facebook
What's behind facebook
 
HBase app HUG talk
HBase app HUG talkHBase app HUG talk
HBase app HUG talk
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Be nice to your designers
Be nice to your designersBe nice to your designers
Be nice to your designers
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-august
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 

Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

  • 1. Hadoop/Mahout/HBase 2011/04/10 #TokyoWebmining10-2 yanaoki 2011 4 18
  • 2. • HBase • Mahout • Naive Bayes • • Web 2011 4 18
  • 3. • naoki yanai • • • … • • • Hadoop • • 2011 4 18
  • 4. HBase • KeyValue • read/write • goal is the hosting of very large tables -- billions of rows , millions of columns ... • Hadoop • CAP C,P • C: ,A: ,P: • Sharding • Hadoop/MapReduce 2011 4 18
  • 5. HBase • • ― • ― • qualifier 2011 4 18
  • 6. Mahout • • Hadoop • • HBase • • • Classifier / Clustering / Pattern Mining • Recommenders / Collaborative Filtering • Evolutionary Algorithms ... 2011 4 18
  • 7. Mahout • • • • Mahout • Mahout in Action PDF • hamadakoichi • TokyoWebmining 2011 4 18
  • 8. Naive Bayes • F1,...,Fn C • C • 2011 4 18
  • 9. Naive Bayes • • • • • • • 2011 4 18
  • 10. Naive Bayes • • • • • • • • 2011 4 18
  • 11. Web • • • • • 2011 4 18
  • 12. 2011 4 18
  • 13. Ruby • ExtractContent require "open-uri" require "extractcontent" html = open("http:// news.nifty.com/....htm").read body, title = ExtractContent::analyse(html) puts body.toutf8 #=> HTML 2011 4 18
  • 14. Ruby • scrAPI require 'scrapi' require 'open-uri' scr = Scraper.define do process "div.tweet", "tweets[]"=> :text result :tweets end tweets = scr.scrape(URI.parse("http://togetter.com/li/ 121476"), :parser_options => {:char_encoding => 'utf8'}) tweets.each{ |tw| puts tw } #=> 2011 4 18
  • 15. RSS HBase • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ … http://ameblo.jp/....html KARA … 2011 4 18
  • 16. HBase category_id <TAB> • HBase MaprReduce HDFS • • • • Wikipedia • 2011 4 18
  • 17. mahout $ mahout trainclassifier ... $ mahout testclassifier … • mahout • --input/--output / • --dataSource HDFS HBase • --gramSize N-gram • --classifierType • --alpha • --minDF/--minSupport / 2011 4 18
  • 18. HBase • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :       1884       82.2348% Incorrectly Classified Instances        :        407       17.7652% Total Classified Instances              :       2291 ======================================================= Confusion Matrix ------------------------------------------------------- a       b       c       d       e       <--Classified as 216     32      22      155     0        |  425         a     = t 0       514     13      70      0        |  597         b     = s 0       2       514     9       0        |  525         c     = e 1       8       13      638     0        |  660         d     = b 0       0       67      15      2        |  84          e     = a Default Category: unknown: 5 2011 4 18
  • 19. • reducer HBase // BayesParameters params = new BayesParameters(); params.set("alpha_i", "1"); algorithm = new CBayesAlgorithm(); datastore = new HBaseBayesDatastore("model_table_name", params); classifier = new ClassifierContext(algorithm, datastore); // ClassifierResult category = classifier.classifyDocument(doc.toArray(new String [doc.size()]), "default"); String label = category.getLabel(); 2011 4 18
  • 20. (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ category:cat=”technology” … http://ameblo.jp/....html KARA … category:cat=”entertainment” 2011 4 18
  • 21. Web 2011 4 18
  • 22. Web • Google News Togetter RSS • • … • … • a 935 5.2M b 5,112 7.2M e 3,746 8.1M s 4,737 12M t 3,969 9.2M 2011 4 18
  • 23. 4/18 Web • • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :      13388        91.6798% Incorrectly Classified Instances        :       1215         8.3202% Total Classified Instances              :      14603 ======================================================= Confusion Matrix ------------------------------------------------------- a         b         c         d         e         <--Classified as 2328      19        515       250       0          |  3112       a     = t 3         2939      54        20        0          |  3016       b     = e 32        3         3542      109       0          |  3686       c     = s 33        16        128       3877      0          |  4054       d     = b 1         27        2         3         702        |  735        e     = a Default Category: unknown: 5 2011 4 18
  • 24. Web • • • alpha 1 0.5 0.1 0.01 0.001 65.38% 65.83% 66.73% 66.82% 67.02% 2011 4 18
  • 25. 4/18 Web • • • N-Gram unigram bigram 63.57% 66.09% 2011 4 18
  • 26. Web • • • + 56.8% 65.38% 2011 4 18
  • 27. 4/18 Web • • • 67.02% 67.88% 2011 4 18
  • 28. • • HBase/Mahout • • HBase 2011 4 18
  • 29. 2011 4 18