SlideShare a Scribd company logo
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito
Introduction: Text Categorization ,[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction: Text Categorization ,[object Object],[object Object],[object Object]
Introduction: Machine Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction: flow of ML ,[object Object],[object Object],[object Object],[object Object],Label1 Label2 ?
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object]
Number of labels ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Yes No L1 L2 L3 L4 L1 L2 L3 L4
Types of labels ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object]
Feature of Text ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Preprocessing ,[object Object],[object Object],[object Object],[object Object]
Term Weighting ,[object Object],[object Object],[object Object],[object Object],[object Object]
Sentiment Weighting ,[object Object],[object Object],[object Object],[object Object],[object Object],d (good, happy) = 2 d (bad, happy) = 4 good bad happy
Dimension Reduction  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Dimension Reduction ,[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object]
Learning Algorithm ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Naïve Bayes ,[object Object],[object Object],[object Object],[object Object]
k-Nearest Neighbor ,[object Object],[object Object],[object Object],[object Object],d1 d2 θ k=3
Boosting ,[object Object],[object Object],[object Object],[object Object],[object Object]
Simple example of Boosting + + + + + - - - - - + + + + + - - - - - 1. - - + + + + + - - - 2. + + + + + - - - - - 3.
Support Vector Machine ,[object Object],[object Object]
Text Categorization with SVM ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Comparison of these methods ,[object Object],[object Object],[object Object],.920 .870 SVM Boosting Naïve Bayes k-NN Method .878 .795 .860 Ver.1(90) -  .815 .823 Ver.2(10)
Hierarchical Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
TreeBoost root L1 L2 L3 L4 L11 L12 L41 L42 L43 L421 L422
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

What's hot

Learning VB.NET Programming Concepts
Learning VB.NET Programming ConceptsLearning VB.NET Programming Concepts
Learning VB.NET Programming Concepts
guest25d6e3
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
feiwin
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
Rupak Roy
 
Ir 09
Ir   09Ir   09
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
Glenn De Backer
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
KU Leuven
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Sms spam classification
Sms spam classificationSms spam classification
Sms spam classification
AnishaAgarwal41
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02
Jeet Das
 
Data types vbnet
Data types vbnetData types vbnet
Data types vbnet
nicky_walters
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
Dev Nath
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
Bhaskar Mitra
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
Sardhendu Mishra
 
Ir 03
Ir   03Ir   03
Development of learned dictionary based spoken language
Development of learned dictionary based spoken languageDevelopment of learned dictionary based spoken language
Development of learned dictionary based spoken language
Pallavi Bharti
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
Mapping cardinality (cardinality constraint) in ER MODEL
Mapping cardinality (cardinality constraint) in ER MODELMapping cardinality (cardinality constraint) in ER MODEL
Mapping cardinality (cardinality constraint) in ER MODEL
RUpaliLohar
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
Trilok Sharma
 
Data base lec3 (erd)
Data base lec3 (erd)Data base lec3 (erd)
Data base lec3 (erd)
Syed Mati Ur Rehman
 

What's hot (20)

Learning VB.NET Programming Concepts
Learning VB.NET Programming ConceptsLearning VB.NET Programming Concepts
Learning VB.NET Programming Concepts
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Ir 09
Ir   09Ir   09
Ir 09
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Sms spam classification
Sms spam classificationSms spam classification
Sms spam classification
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02
 
Data types vbnet
Data types vbnetData types vbnet
Data types vbnet
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Ir 03
Ir   03Ir   03
Ir 03
 
Development of learned dictionary based spoken language
Development of learned dictionary based spoken languageDevelopment of learned dictionary based spoken language
Development of learned dictionary based spoken language
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
 
Mapping cardinality (cardinality constraint) in ER MODEL
Mapping cardinality (cardinality constraint) in ER MODELMapping cardinality (cardinality constraint) in ER MODEL
Mapping cardinality (cardinality constraint) in ER MODEL
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
Data base lec3 (erd)
Data base lec3 (erd)Data base lec3 (erd)
Data base lec3 (erd)
 

Viewers also liked

Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, ExperiencesMachine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
butest
 
BibTex.doc
BibTex.docBibTex.doc
BibTex.doc
butest
 
[PPT]
[PPT][PPT]
[PPT]
butest
 
Publications_FINAL.docx - The University of Maryland
Publications_FINAL.docx - The University of MarylandPublications_FINAL.docx - The University of Maryland
Publications_FINAL.docx - The University of Maryland
butest
 
Elegant Resume
Elegant ResumeElegant Resume
Elegant Resume
butest
 
Polynomial stations
Polynomial stationsPolynomial stations
Polynomial stations
Erik Tjersland
 
Tearn Up pitch deck.pdf
Tearn Up pitch deck.pdfTearn Up pitch deck.pdf
Tearn Up pitch deck.pdf
asenju
 

Viewers also liked (7)

Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, ExperiencesMachine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
 
BibTex.doc
BibTex.docBibTex.doc
BibTex.doc
 
[PPT]
[PPT][PPT]
[PPT]
 
Publications_FINAL.docx - The University of Maryland
Publications_FINAL.docx - The University of MarylandPublications_FINAL.docx - The University of Maryland
Publications_FINAL.docx - The University of Maryland
 
Elegant Resume
Elegant ResumeElegant Resume
Elegant Resume
 
Polynomial stations
Polynomial stationsPolynomial stations
Polynomial stations
 
Tearn Up pitch deck.pdf
Tearn Up pitch deck.pdfTearn Up pitch deck.pdf
Tearn Up pitch deck.pdf
 

Similar to [ppt]

A Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationA Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text Classification
Joshua Gorinson
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
ijaia
 
Text Classification.pptx
Text Classification.pptxText Classification.pptx
Text Classification.pptx
hezamgawbah
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
butest
 
Search Engines
Search EnginesSearch Engines
Search Engines
butest
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
 
pmuthoju_presentation.ppt
pmuthoju_presentation.pptpmuthoju_presentation.ppt
pmuthoju_presentation.ppt
butest
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET Journal
 
Review of Various Text Categorization Methods
Review of Various Text Categorization MethodsReview of Various Text Categorization Methods
Review of Various Text Categorization Methods
iosrjce
 
C017321319
C017321319C017321319
C017321319
IOSR Journals
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
IJRAT
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
ijnlc
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
kevig
 
Machine_learning_presentation_on_movie_recomendation_system.pptx
Machine_learning_presentation_on_movie_recomendation_system.pptxMachine_learning_presentation_on_movie_recomendation_system.pptx
Machine_learning_presentation_on_movie_recomendation_system.pptx
arunchoubeybxr
 
Team G
Team GTeam G
Team G
butest
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
Infrrd
 
Part 1
Part 1Part 1
Part 1
butest
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
IRJET Journal
 
Mariia Havrylovych "Active learning and weak supervision in NLP projects"
Mariia Havrylovych "Active learning and weak supervision in NLP projects"Mariia Havrylovych "Active learning and weak supervision in NLP projects"
Mariia Havrylovych "Active learning and weak supervision in NLP projects"
Fwdays
 

Similar to [ppt] (20)

A Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationA Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text Classification
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
Text Classification.pptx
Text Classification.pptxText Classification.pptx
Text Classification.pptx
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
pmuthoju_presentation.ppt
pmuthoju_presentation.pptpmuthoju_presentation.ppt
pmuthoju_presentation.ppt
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
 
Review of Various Text Categorization Methods
Review of Various Text Categorization MethodsReview of Various Text Categorization Methods
Review of Various Text Categorization Methods
 
C017321319
C017321319C017321319
C017321319
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
Machine_learning_presentation_on_movie_recomendation_system.pptx
Machine_learning_presentation_on_movie_recomendation_system.pptxMachine_learning_presentation_on_movie_recomendation_system.pptx
Machine_learning_presentation_on_movie_recomendation_system.pptx
 
Team G
Team GTeam G
Team G
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
Part 1
Part 1Part 1
Part 1
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
 
Mariia Havrylovych "Active learning and weak supervision in NLP projects"
Mariia Havrylovych "Active learning and weak supervision in NLP projects"Mariia Havrylovych "Active learning and weak supervision in NLP projects"
Mariia Havrylovych "Active learning and weak supervision in NLP projects"
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
PPT
PPTPPT
PPT
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
hier
hierhier
hier
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

[ppt]

  • 1. A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Simple example of Boosting + + + + + - - - - - + + + + + - - - - - 1. - - + + + + + - - - 2. + + + + + - - - - - 3.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. TreeBoost root L1 L2 L3 L4 L11 L12 L41 L42 L43 L421 L422
  • 27.
  • 28.

Editor's Notes

  1. インターネットの普及やコンピュータを用いた文書の電子化が進むにつれて、 メールやニュース、ブログ等、大量の電子化されたデータが入手可能となってきた。 それに従い、時間や人的コストの観点から、 人手を介さずに大量の文書を効率良く分類する必要が高まってきている。
  2. 例えばテキストを自動的にどのトピックに属するかを調べたり、 Webからの評判を抽出、といった応用が挙げられる。
  3. そこで、テキストを自動で分類するための手法として最も広く用いられているのが、 単語などのテキスト情報を元にした機械学習の手法である。 機械学習は広く分けて教師あり、教師無し、があるが、 本輪講では教師あり学習について述べる
  4. ここでテキスト分類における機械学習の主な流れを示す。 まず、自然言語で書かれたテキストを機械が扱えるような形に変換する。 (特徴抽出) そしてその特徴を用いて学習器で学習する。 (学習) 未知のデータが来た場合、訓練した学習器を元にデータを分類する。 (分類) このようにテキスト分類は一般で用いられる機械学習の流れとほぼ同じなため、 機械学習の分野で広く研究されている。 ここでは、このそれぞれの段階について用いられている手法の調査を行う。
  5. ここではテキストデータからの特徴抽出について説明する。 まず、自然言語で書かれたデータを形態素解析等を用いて何らかの数値データに変換する必要がある。
  6. この場合、例えば英語で言えばthe, for, 等の非常に頻繁に出てくる単語は「ストップワード」として 取り除かれる必要がある。
  7. まず最初に思いつく最も単純な方法として、各単語の出現回数を数える方法が考えられる。 文書数×単語数のベクトルを考え、どの文書にどの単語が何回出現するのか、を表す。 この場合、非常に単純にデータを扱うことが出来るが、出現回数のみを見ているのであまり精度が出ない
  8. ここで考えられるのが tf-idf 法である。 これは、(単語がある文書に出てくる頻度) × (単語が出てくる文書数の逆数)をとったもので、 文書に頻繁に出てきて、また全体ではあまり出てこない単語に高い重みがつくようになっており、 テキスト分類における特徴抽出の方法として広く用いられている。 基本的に文書の特徴は tf-idf か、あるいはこの値を正規化したものを用いることが 事実上標準となっており、新たな研究はあまり行われていない。
  9. 上のままだと文書を表すベクトルが文書数×辞書の単語数、とかなり大きくなってしまう。 そこで、この次元数を削減するために特徴選択が用いられる。
  10. ここで用いられているものは、まず一つは出現頻度に特定のスレッショルドを設けることである。 単語が出てくる文書数一定回以上出てない単語は学習に用いない。 これは、非常に少ない文書にしか出てこない単語は分類の役に立たないであろう、という推測に基づいている。