Submit Search
Upload
Introduction to document classification 2014
•
3 likes
•
1,577 views
Hiroyuki TOKUNAGA
Follow
2014/06/19のPFIセミナーの発表資料です
Read less
Read more
Engineering
Report
Share
Report
Share
1 of 22
Download now
Download to read offline
Recommended
EMNLP2014読み会 徳永
EMNLP2014読み会 徳永
Hiroyuki TOKUNAGA
Active Learning 入門
Active Learning 入門
Shuyo Nakatani
ACL2014読み会:Fast and Robust Neural Network Joint Models for Statistical Machin...
ACL2014読み会:Fast and Robust Neural Network Joint Models for Statistical Machin...
Hiroyuki TOKUNAGA
NLP若手の回 ACL2012参加報告
NLP若手の回 ACL2012参加報告
Hiroyuki TOKUNAGA
2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
Recommended
EMNLP2014読み会 徳永
EMNLP2014読み会 徳永
Hiroyuki TOKUNAGA
Active Learning 入門
Active Learning 入門
Shuyo Nakatani
ACL2014読み会:Fast and Robust Neural Network Joint Models for Statistical Machin...
ACL2014読み会:Fast and Robust Neural Network Joint Models for Statistical Machin...
Hiroyuki TOKUNAGA
NLP若手の回 ACL2012参加報告
NLP若手の回 ACL2012参加報告
Hiroyuki TOKUNAGA
2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
Skeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
Introduction to Data Science
Introduction to Data Science
Christy Abraham Joy
Time Management & Productivity - Best Practices
Time Management & Productivity - Best Practices
Vit Horky
The six step guide to practical project management
The six step guide to practical project management
MindGenius
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
ChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
More Related Content
Featured
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
Skeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
Introduction to Data Science
Introduction to Data Science
Christy Abraham Joy
Time Management & Productivity - Best Practices
Time Management & Productivity - Best Practices
Vit Horky
The six step guide to practical project management
The six step guide to practical project management
MindGenius
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
ChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
Featured
(20)
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
Skeleton Culture Code
Skeleton Culture Code
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
How to have difficult conversations
How to have difficult conversations
Introduction to Data Science
Introduction to Data Science
Time Management & Productivity - Best Practices
Time Management & Productivity - Best Practices
The six step guide to practical project management
The six step guide to practical project management
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
ChatGPT webinar slides
ChatGPT webinar slides
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Introduction to document classification 2014
1.
2014年の文書分類入門 徳永 拓之
2014/06/19 PFIセミナー
2.
自己紹介:徳永拓之 ● 主に自然言語処理をしています
● http://tkng.org ● twitter:@tkng ● カレーが好き ● 1年で7kgぐらい痩せました
3.
本日のテーマ:文書分類 文書を2つ以上のクラスに分類する ●
スパム判定 o スパムかそうでないかの二値分類 ● 評判分類 o すごくいい/いい/普通/ダメ/論外 の五値分類とか なんだかんだ仕事で触れる機会が多い
4.
文書分類と私 http://d.hatena.ne.jp/tkng/20081217/1229475900
5.
問題定義 入力xに対し、出力y =
1 or -1を出力するような 関数f(x)を学習する ● 特に、f(x) = [w ・x] の場合、f(x)を線形分類 器と呼ぶ(ただし、[]は中身の符号に応じて1 or -1を返すものとする) ● 今日の話はだいたい線形分類の話
6.
bag of wordsの復習
xは、x_kはある単語kが文書に含まれていれば 1、そうでなければ0となるようなベクトル 線形識別器では、wはxの各次元の重みを学習 していることになる
7.
最近の研究いろいろ ● NBSVM
(ACL2012) ● Volume Regularization (NIPS2012) ● Sentence Regularization (ICML2014) ● Paragraph Vector (ICML2014)
8.
生成モデルと識別モデル 生成モデル:Naive Bayesなど
識別モデル:ロジスティック回帰、SVMなど 識別モデルの方が性能は高いと言われている
9.
生成モデル(Naive Bayes)の概要 P(y|x)
∝ P(x|y)P(y) P(y)もP(x|y)もカウントで簡単に計算できる
10.
識別モデル(SVM)の概要 min. L(w)
= Σ max(1 - y w・x, 0) + ||w||^2 損失項と正則化項の和を最小化する ● 損失項 :学習データに対する間違いを少なく ● 正則化項:過度に複雑なモデルを防ぐ
11.
本当に識別モデルの方がいいのか? ● 誤分類の原因を調査すると、識別モデルの
学習結果に納得できないことが結構ある o あるクラスに20回、他のクラスに2回しか出てこな い単語の重みが0.1ぐらいしかない o 一方、あるクラスに1000回、他のクラスに2000回 ぐらい出てる単語の重みが1.3ぐらいになったりする
12.
何が起きているのか? ● 正則化をかけると、出現頻度が低い単語の
重みはどうしても低くなりがち ● かといって正則化を弱めると、性能は悪く なることが多い 経験則としては、珍しい単語は重要な事が多い のだが、正則化により重みが消えてしまう
13.
tf idf tf
idf: ある単語の重要度をはかる指標 tf: term frequency df: document frequency idf: inversed document frequency
14.
idfを考慮した文書分類 Tackling the
Poor Assumptions of Naive Bayes Text Classifiers ( Rennie et al., ICML2003) ● 最初に紹介したComplement Naive Bayesの論文
15.
Complement Naive Bayesを振り返る
Term Weighted CNBはSVMと同等の性能…
16.
____
/⌒ ⌒ \ /( ●) (●) \ /::::::⌒(__人__)⌒::::: \ だったらSVMにidfを入れるお! | |r┬-| | \ `ー'´ / ※厳密には、idfではなく出現率の比を見ているのでモチベーションが違います
17.
Naive Bayes SVM
(Wang, ACL2012) クラス比で入力単語の重みを事前にスケーリン グしてからSVMにつっこむ え、そんな単純な手法でいいの…?
18.
実験結果 普通のSVMより2~3%性能が向上
19.
感想 ● SVMに突っ込む前にちょっと前処理すると
性能が上がる、おいしい ● 黒魔術不要なのでベースラインによい ● 何をやってるかわかりやすいのもポイント
20.
余談:数式の型が気持ち悪い matlabだと実数とベクトルが足せるらしい
21.
今後の展望、期待 ● 文書分類で現状のstate
of the artは paragraph vectorだがこれは黒魔術感がある ● 文書分類の特徴(高次元、スパース、低頻 度特徴にも明らかに重要なのがある)を活 かした新しい手法はまだ開発の余地がある のでは
22.
最近の研究いろいろ(再掲) ● NBSVM
(ACL2012) ● Volume Regularization (NIPS2012) ● Sentence Regularization (ICML2014) ● Paragraph Vector (ICML2014)
Download now