Camouflaged chinese spam_content_detection_with_semi-supervised_generative_active_learning

Camouflaged Chinese Spam
Content Detection with Semi-
supervised Generative Active
Learning
Zhuoren Jiang, Zhe Gao, Yu Duan, Yangyang Kang2, Changlong Sun, Qiong Zhang, Xiaozhong Liu
ACL 2020 , @論文読み会, 紹介者: Yoshiaki Kitagawa
※スライド中の資料は論文から引用しています

Summary
 Task
 Mobile Spam Detection
 SMS message に含まれる spam message
を検出するタスク
 先行研究と比べてどこがすごいか？
 spam message に sensitive
 新しい sample を探す際に、Labeled デー
タとの比較の必要がなく、 O(N) の計算量
 中国語の特徴を扱える
 技術や手法のキモ
 Data hungry problem に有効な手法
 Self-Diversity Based Active Learning
 S-VAE with Masked Attention Learning
 どうやって有効だと検証したか？
 Spam, non-spam メッセージの2値分類
 Iteration を回して spam message の取得数とaccuracy
を比較
 議論はあるか？
 Iteration を回す際に人手の介入があるので、正確な比
較ができないのではないか？
 次に読むべき論文
 Active Learning の元論文 [Cohn et al., 1996]
 S-VAE の元論文 [Kingma et al., 2014]

Introduction
 機会学習のラベリングはコストが高い
 tedious, laborious, and time consuming task for humans
 高いパフォーマンスを低いアノテーションコストで達成するために active learning があるが、
spam message 検出では次の課題がある
 課題:
 Imbalance: spam の比率がとても低い “much less than 1% of SMS messages were spam”
 Efficiency: unlabeled data を labeled data と比較するときに計算量が O(N^2)
 Camouflage: spamer に見た目、音的な違いを利用される

SIGNAL (Semi-supervised Generative Active Learning)Model

Self-Diversity Based Active Learning
 アノテーションをする価値があるかどうかを測る指標 SDi を導入
 p は現在の classifier の prediction
 （数式の説明が書けないので詳細は論文の方が良いです）

S-VAE with Masked Attention Learning
 Semi-supervised Variational AutoEncoder (S-VAE) (Kingma et al., 2014)
 似た text を生成するのに利用する
 確率的にマスクをかけて S-VAE を利用する

Character Variation Graph-enhanced
Augmentation
 S-VAE で生成した text を拡張する
 A Chinese character variation graph G (Jiang et al., 2019a) でグラフを作りランダム
ウォークでエッジを辿って拡張を行うイメージっぽい

データ&評価
 Chinese SMS dataset:
 48,896 testing samples
 23,891 spam samples
 25,005 normal samples.
 200件を最初の labeled セットとしてランダムサンプリングしてイテレーションを
回して評価
 評価は 10回イテレーションを回したときに取得できる spam message の数と
acuuracy

実験結果
 Baseline:
 Uncertainty [Lewis and Gale, 1994]
 Margin [Roth and Small, 2006]
 Entolopy [Li and Guo, 2013]
 A: 10イテレーション回した際の spam
sample の取得数
 B,C,D: spam sample 取得数に対しての
accuracy

実験結果2
 Stylistic Information (文体情報) は fp rate を変えずに、fn rate を強めることができている
 すなわち誤検出率を変えずに、recall を高める効果がある（解釈）

参考資料
 論文: https://www.aclweb.org/anthology/2020.acl-main.279.pdf

Camouflaged chinese spam_content_detection_with_semi-supervised_generative_active_learning

Recommended

Recommended

More Related Content

More from Ace12358

More from Ace12358 (14)

Recently uploaded

Recently uploaded (8)

Camouflaged chinese spam_content_detection_with_semi-supervised_generative_active_learning