The contribution of_stylistic_information_to_content-based_mobile_spam_filtering

The Contribution of Stylistic
Information to Content-based
Mobile Spam Filtering
Dae-Neung Sohn, Jung-Tae Lee and Hae-Chang Rim
ACL 2009 , @論文読み会, 紹介者: Yoshiaki Kitagawa
※スライド中の資料は論文から引用しています

Summary
 Task
 Mobile Spam Detection
 SMS message に含まれる spam message
を検出するタスク
 先行研究と比べてどこがすごいか？
 今までの基本的な単語、文字 n-gram に
加えて、Stylistic Information (文体情報)
が使えることを示したこと
 技術や手法のキモ
 Stylistic Information (文体情報)を表す素性
4つ
 Length features: LEN
 Function word frequencies: FW
 Part-of-speech n-grams: POS
 Special characters: SC
 どうやって有効だと検証したか？
 Spam, non-spam メッセージの2値分類
 1-AUC (TREC というデータセットと評価 toolkit があったら
しいがそれに準拠)
 ROC 曲線
 議論はあるか？
 韓国語以外でも Stylistic Information が有用か？
 データの分布がランダムサンプリングではない
 fp, fn の例が見たかった（韓国語なのでわからないが）
 Deep な言語モデルで Stylistic Information を捉えられるなら
面白そう
 次に読むべき論文
 Spam 検出系のデータセットと手法がまとまった資料があれ
ば知りたい
 Web spam detection の論文も読みたい
 Spam review detection の論文多かった
 Mendenhall [1887] が古すぎて気になった

Introduction
 Spam message には “loan” とか “70% off sale” とかいう単語がよく含まれているが、含まれていな
いからといって、 legitimate message がそういう単語を含まないという保証はない
 だから内容語だけ見るのではなく、style information (文体情報) を使えないかと考えた
 仮定:
 Spammer と non-Spammer の2種類の人がいる
 Spammer は文体情報（やら言語の表現方法など）に distinctive な特徴がある
 SMS message は書き手の指紋を残す

Stylistic Feature Set
 Length features: LEN
 SMS messages のバイト長と単語の平均バイト長
 Function word frequencies: FW
 文字通り、機能語の頻度
 Part-of-speech n-grams: POS
 文字通り、品詞の n-gram (n=1,2,3)
 Special characters: SC
 439 emoticons と 229 special patterns の辞書を作成
 Non-spammer: “:-)” (smiling) とか “T T” (crying)
 Spammer: “$$$” とか “%”

Mobile Spam Filter の学習
 手法: 最大エントロピーモデル
 パラメータ推定: L-BFGS algorithm (準ニュートン法)
 素性選択: Information Gain (Information Gain)
memo
最大エントロピーモデル参考: https://takeda25.hatenablog.jp/entry/20121105/1352385394

データ
 韓国語 SMS messages
 18,000 (60%) legitimate messages
 12,000 (40%) spam messages

評価指標
 1- AUC ; 低ければ低いほど良い指標
 TREC というデータセットと評価 toolkit があったらしいがそれに準拠
 ROC 曲線
 ROC曲線は一般に以下だと思うが、
 TPR(True Positive Ratio)=TP/(TP+FN)を縦軸
 FPR(False Positive Ratio)=FP/(FP+TN)を横軸
 本論文では以下のような ROC 曲線を見ていて、logit になっていることに注意
 logit (FNR(False Negative Ratio)=FN/(TP+FN))を縦軸 (反転)
 logit (FPR(False Positive Ratio)=FP/(FP+TN))を横軸

実験結果1
 Baseline: word, character n-gram
 Proposed: stylistic features 4つ
 Combine: Baseline + Proposed

実験結果2
 Stylistic Information (文体情報) は fp rate を変えずに、fn rate を強めることができている
 すなわち誤検出率を変えずに、recall を高める効果がある（解釈）

実験結果3
 LEN, FW, POS, SC は前述のStylistic Information (文体情報) 素性
 POS がそんなに貢献していないのが意外（著者らも同意見）。SC はやはり結構効いてそう

参考資料
 論文: https://www.aclweb.org/anthology/P09-2081.pdf

The contribution of_stylistic_information_to_content-based_mobile_spam_filtering

Recommended

Recommended

More Related Content

Similar to The contribution of_stylistic_information_to_content-based_mobile_spam_filtering

Similar to The contribution of_stylistic_information_to_content-based_mobile_spam_filtering (20)

More from Ace12358

More from Ace12358 (17)

Recently uploaded

Recently uploaded (8)

The contribution of_stylistic_information_to_content-based_mobile_spam_filtering