SlideShare a Scribd company logo
1 of 22
Download to read offline
2014年の文書分類入門 
徳永 拓之 
2014/06/19 PFIセミナー
自己紹介:徳永拓之 
● 主に自然言語処理をしています 
● http://tkng.org 
● twitter:@tkng 
● カレーが好き 
● 1年で7kgぐらい痩せました
本日のテーマ:文書分類 
文書を2つ以上のクラスに分類する 
● スパム判定 
o スパムかそうでないかの二値分類 
● 評判分類 
o すごくいい/いい/普通/ダメ/論外 の五値分類とか 
なんだかんだ仕事で触れる機会が多い
文書分類と私 
http://d.hatena.ne.jp/tkng/20081217/1229475900
問題定義 
入力xに対し、出力y = 1 or -1を出力するような 
関数f(x)を学習する 
● 特に、f(x) = [w ・x] の場合、f(x)を線形分類 
器と呼ぶ(ただし、[]は中身の符号に応じて1 or -1を返すものとする) 
● 今日の話はだいたい線形分類の話
bag of wordsの復習 
xは、x_kはある単語kが文書に含まれていれば 
1、そうでなければ0となるようなベクトル 
線形識別器では、wはxの各次元の重みを学習 
していることになる
最近の研究いろいろ 
● NBSVM (ACL2012) 
● Volume Regularization (NIPS2012) 
● Sentence Regularization (ICML2014) 
● Paragraph Vector (ICML2014)
生成モデルと識別モデル 
生成モデル:Naive Bayesなど 
識別モデル:ロジスティック回帰、SVMなど 
識別モデルの方が性能は高いと言われている
生成モデル(Naive Bayes)の概要 
P(y|x) ∝ P(x|y)P(y) 
P(y)もP(x|y)もカウントで簡単に計算できる
識別モデル(SVM)の概要 
min. L(w) = Σ max(1 - y w・x, 0) + ||w||^2 
損失項と正則化項の和を最小化する 
● 損失項 :学習データに対する間違いを少なく 
● 正則化項:過度に複雑なモデルを防ぐ
本当に識別モデルの方がいいのか? 
● 誤分類の原因を調査すると、識別モデルの 
学習結果に納得できないことが結構ある 
o あるクラスに20回、他のクラスに2回しか出てこな 
い単語の重みが0.1ぐらいしかない 
o 一方、あるクラスに1000回、他のクラスに2000回 
ぐらい出てる単語の重みが1.3ぐらいになったりする
何が起きているのか? 
● 正則化をかけると、出現頻度が低い単語の 
重みはどうしても低くなりがち 
● かといって正則化を弱めると、性能は悪く 
なることが多い 
経験則としては、珍しい単語は重要な事が多い 
のだが、正則化により重みが消えてしまう
tf idf 
tf idf: ある単語の重要度をはかる指標 
tf: term frequency 
df: document frequency 
idf: inversed document frequency
idfを考慮した文書分類 
Tackling the Poor Assumptions of Naive Bayes Text 
Classifiers ( Rennie et al., ICML2003) 
● 最初に紹介したComplement Naive Bayesの論文
Complement Naive Bayesを振り返る 
Term Weighted CNBはSVMと同等の性能…
       ____ 
     /⌒  ⌒ \ 
   /( ●)  (●) \ 
  /::::::⌒(__人__)⌒::::: \   だったらSVMにidfを入れるお! 
  |     |r┬-|     | 
  \      `ー'´      / 
※厳密には、idfではなく出現率の比を見ているのでモチベーションが違います
Naive Bayes SVM (Wang, ACL2012) 
クラス比で入力単語の重みを事前にスケーリン 
グしてからSVMにつっこむ 
え、そんな単純な手法でいいの…?
実験結果 
普通のSVMより2~3%性能が向上
感想 
● SVMに突っ込む前にちょっと前処理すると 
性能が上がる、おいしい 
● 黒魔術不要なのでベースラインによい 
● 何をやってるかわかりやすいのもポイント
余談:数式の型が気持ち悪い 
matlabだと実数とベクトルが足せるらしい
今後の展望、期待 
● 文書分類で現状のstate of the artは 
paragraph vectorだがこれは黒魔術感がある 
● 文書分類の特徴(高次元、スパース、低頻 
度特徴にも明らかに重要なのがある)を活 
かした新しい手法はまだ開発の余地がある 
のでは
最近の研究いろいろ(再掲) 
● NBSVM (ACL2012) 
● Volume Regularization (NIPS2012) 
● Sentence Regularization (ICML2014) 
● Paragraph Vector (ICML2014)

More Related Content

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Introduction to document classification 2014