Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Topic Models Based Personalized Spam Filter Sudarsun. S Director – R & D, Checktronix India Pvt Ltd, Chennai Venkatesh Pra...
<ul><li>What is Spam ? </li></ul><ul><ul><li>unsolicited, unwanted email   </li></ul></ul><ul><li>What is Spam Filtering ?...
Content Based Filtering <ul><li>What does the message contain ? </li></ul><ul><ul><li>Images, Text, URL </li></ul></ul><ul...
Content-Based Filtering -- Methods <ul><ul><li>Bayesian Spam Filtering </li></ul></ul><ul><ul><ul><li>Simplest Design / Le...
<ul><li>Topic Models </li></ul><ul><ul><li>Treats every word as a feature </li></ul></ul><ul><ul><li>Represents the corpus...
<ul><li>Describes underlying structure among text. </li></ul><ul><li>Computes similarities between text. </li></ul><ul><li...
PLSA Model <ul><li>By PLSA model, a document is a mixture of topics and topics generate words. </li></ul><ul><li>The proba...
N–Gram Approach <ul><li>Language Model Approach </li></ul><ul><li>Looks for repeated patterns </li></ul><ul><li>Each word ...
Overall System Architecture Training Mails   Preprocessor LSA  Model PLSA  Model N-Gram Other  Classifiers Combiner Final ...
Preprocessing <ul><li>Feature Extraction </li></ul><ul><ul><li>Tokenizing </li></ul></ul><ul><li>Feature Selection </li></...
Principle Component Analysis - PCA <ul><li>Data Reduction -  Ignore the features of lesser significance </li></ul><ul><ul>...
LSA Classification Score Input Mails LSA Model PCA BPN Token List Vector 1xR R: Rank MxR M: Vocab Size R: Rank Vector 1xR’...
PLSA Classification Score Input Mails PLSA Model PCA BPN Token List Vector 1xZ Z: Aspects MxZ M: Vocab Size R: Aspects Cou...
<ul><li>Model Training </li></ul><ul><ul><li>Build the Global (P)LSA model using the training mails. </li></ul></ul><ul><u...
N-Gram method <ul><li>Construct an N-Gram tree out of training docs </li></ul><ul><li>Documents make the leaves </li></ul>...
An Example N-Gram Tree T5 T1 T2 T3 T4 3 rd   2 nd   N1 2 nd   1 s t   N2 N3 N 4
Combiner <ul><li>Mixture of Experts </li></ul><ul><ul><li>Get Predictions from all the Experts </li></ul></ul><ul><ul><li>...
Conclusion <ul><ul><li>Objective is to Filter mail messages based on the preference of an individual </li></ul></ul><ul><u...
References [1]I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. “An Evaluation of Na...
Any Queries…. ? You can post your queries to  [email_address]
Upcoming SlideShare
Loading in …5
×

Topic Models Based Personalized Spam Filter

10,063 views

Published on

Spam filtering poses a critical problem in
text categorization as the features of text is
continuously changing. Spam evolves continuously and
makes it difficult for the filter to classify the evolving
and evading new feature patterns. Most practical
applications are based on online user feedback, the
task calls for fast, incremental and robust learning
algorithms. This paper presents a system for
automatically detection and filtering of unsolicited
electronic messages. In this paper, we have developed
a content-based classifier, which uses two topic models
LSI and PLSA complemented with a text patternmatching
based natural language approach. By
combining these powerful statistical and NLP
techniques we obtained a parallel content based Spam
filter, which performs the filtration in two stages. In
the first stage each model generates its individual
predictions, which are combined by a voting
mechanism as the second stage.

Published in: Business, Technology
  • Be the first to comment

Topic Models Based Personalized Spam Filter

  1. 1. Topic Models Based Personalized Spam Filter Sudarsun. S Director – R & D, Checktronix India Pvt Ltd, Chennai Venkatesh Prabhu. G Research Associate, Checktronix India Pvt Ltd, Chennai Valarmathi B Professor, SKP Engineering College, Thiruvannamalai
  2. 2. <ul><li>What is Spam ? </li></ul><ul><ul><li>unsolicited, unwanted email </li></ul></ul><ul><li>What is Spam Filtering ? </li></ul><ul><ul><li>Detection/Filtering of unsolicited content </li></ul></ul><ul><li>What’s Personalized Spam Filtering ? </li></ul><ul><ul><li>Definition of “unsolicited” becomes personal </li></ul></ul><ul><li>Approaches </li></ul><ul><ul><li>Origin-Based Filtering [ Generic ] </li></ul></ul><ul><ul><li>Content Based-Filtering [ Personalized ] </li></ul></ul>
  3. 3. Content Based Filtering <ul><li>What does the message contain ? </li></ul><ul><ul><li>Images, Text, URL </li></ul></ul><ul><li>Is it “irrelevant” to my preferences ? </li></ul><ul><ul><li>How to define relevancy ? </li></ul></ul><ul><ul><li>How does the system understands relevancy ? </li></ul></ul><ul><ul><ul><li>Supervised Learning </li></ul></ul></ul><ul><ul><ul><ul><li>Teach the system about what I like and what I don’t </li></ul></ul></ul></ul><ul><ul><ul><li>Unsupervised Learning </li></ul></ul></ul><ul><ul><ul><ul><li>Decision made using latent patterns </li></ul></ul></ul></ul>
  4. 4. Content-Based Filtering -- Methods <ul><ul><li>Bayesian Spam Filtering </li></ul></ul><ul><ul><ul><li>Simplest Design / Less computation cost </li></ul></ul></ul><ul><ul><ul><li>Based on keyword distribution </li></ul></ul></ul><ul><ul><ul><li>Cannot work on contexts </li></ul></ul></ul><ul><ul><ul><li>Accuracy is around 60% </li></ul></ul></ul><ul><ul><li>Topic Models based Text Mining </li></ul></ul><ul><ul><ul><li>Based on distribution of n-grams (key phrases) </li></ul></ul></ul><ul><ul><ul><li>Addresses Synonymy and Polysemy </li></ul></ul></ul><ul><ul><ul><li>Run-time computation cost is less </li></ul></ul></ul><ul><ul><ul><li>Unsupervised technique </li></ul></ul></ul><ul><ul><li>Rule based Filtering </li></ul></ul><ul><ul><ul><li>Supervised technique based on hand-written rules </li></ul></ul></ul><ul><ul><ul><li>Best accuracy for known cases </li></ul></ul></ul><ul><ul><ul><li>Cannot adopt to new patterns </li></ul></ul></ul>
  5. 5. <ul><li>Topic Models </li></ul><ul><ul><li>Treats every word as a feature </li></ul></ul><ul><ul><li>Represents the corpus as a higher-dimensional distribution </li></ul></ul><ul><ul><li>SVD: Decomposes the higher-dimensional data to a small reduced sub-space containing only the dominant feature vectors </li></ul></ul><ul><ul><li>PLSA: Documents can be understood as a mixture of topics </li></ul></ul><ul><li>Rule Based Approaches </li></ul><ul><ul><li>N-Grams – Language Model Approach </li></ul></ul><ul><ul><li>More common n-grams  more closer the patterns are. </li></ul></ul>
  6. 6. <ul><li>Describes underlying structure among text. </li></ul><ul><li>Computes similarities between text. </li></ul><ul><li>Represents documents in high-dimensional Semantic Space (Term – Document Matrix). </li></ul><ul><li>High dimensional space is approximated to low-dimensional space using Singular Value Decomposition (SVD). </li></ul><ul><li>Decomposes the higher dimensional TDM to U, S, V matrices. </li></ul><ul><li>U: Left Singular Vectors ( reduced word vectors ) </li></ul><ul><li>V: Right Singular Vector ( reduced document vectors ) </li></ul><ul><li>S: Array of Singular Values ( variances or scaling factor ) </li></ul>LSA Model, In Brief
  7. 7. PLSA Model <ul><li>By PLSA model, a document is a mixture of topics and topics generate words. </li></ul><ul><li>The probabilistic latent factor model can be described as the following generative model </li></ul><ul><ul><li>Select a document d i from D with probability Pr ( d i ). </li></ul></ul><ul><ul><li>Pick a latent factor z k with probability Pr ( z k |d i ). </li></ul></ul><ul><ul><li>Generate a word w j from W with probability Pr ( w j |z k ). </li></ul></ul>Where <ul><li>Computing the aspects model parameters using EM Algorithm </li></ul>
  8. 8. N–Gram Approach <ul><li>Language Model Approach </li></ul><ul><li>Looks for repeated patterns </li></ul><ul><li>Each word depends probabilistically on the n-1 preceding words. </li></ul><ul><li>Calculating and Comparing the N-Gram profiles. </li></ul>
  9. 9. Overall System Architecture Training Mails Preprocessor LSA Model PLSA Model N-Gram Other Classifiers Combiner Final Result Test Mail … .
  10. 10. Preprocessing <ul><li>Feature Extraction </li></ul><ul><ul><li>Tokenizing </li></ul></ul><ul><li>Feature Selection </li></ul><ul><ul><li>Pruning </li></ul></ul><ul><ul><li>Stemming </li></ul></ul><ul><ul><li>Weighting </li></ul></ul><ul><li>Feature Representation </li></ul><ul><ul><li>Term Document Matrix Generation </li></ul></ul><ul><li>Sub Spacing </li></ul><ul><ul><li>LSA / PLSA Model Projection </li></ul></ul><ul><li>Feature Reduction </li></ul><ul><ul><li>Principle Component Analysis </li></ul></ul>
  11. 11. Principle Component Analysis - PCA <ul><li>Data Reduction - Ignore the features of lesser significance </li></ul><ul><ul><li>Given N data vectors from k -dimensions, find c <= k orthogonal vectors that can be best used to represent data </li></ul></ul><ul><ul><li>The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) </li></ul></ul><ul><li>To detect structure in the relationship between variables that is used to classify data. </li></ul>
  12. 12. LSA Classification Score Input Mails LSA Model PCA BPN Token List Vector 1xR R: Rank MxR M: Vocab Size R: Rank Vector 1xR’ RxR’ R: InVar Size R’: OutVar Size
  13. 13. PLSA Classification Score Input Mails PLSA Model PCA BPN Token List Vector 1xZ Z: Aspects MxZ M: Vocab Size R: Aspects Count Vector 1xZ’ ZxZ’ Z: InVar Size Z’: OutVar Size
  14. 14. <ul><li>Model Training </li></ul><ul><ul><li>Build the Global (P)LSA model using the training mails. </li></ul></ul><ul><ul><li>Vectorize the training mails using LSI/PSLA model </li></ul></ul><ul><ul><li>Reduce the dimensionality of the matrix of pseudo vectors of training documents using PCA. </li></ul></ul><ul><ul><li>Feed the reduced matrix into neural networks for learning. </li></ul></ul><ul><li>Model Testing </li></ul><ul><ul><li>Test mails is fed to (P)LSA for vectorization. </li></ul></ul><ul><ul><li>Vector is reduced using PCA model. </li></ul></ul><ul><ul><li>Reduced vector is fed into BPN neural network. </li></ul></ul><ul><ul><li>BPN network emits its prediction with a confidence score </li></ul></ul>(P)LSA Classification
  15. 15. N-Gram method <ul><li>Construct an N-Gram tree out of training docs </li></ul><ul><li>Documents make the leaves </li></ul><ul><li>Nodes make the identified N-grams from docs </li></ul><ul><li>Weight of an N-gram = Number of children </li></ul><ul><li>Higher order of N-gram implies more weight </li></ul><ul><li>Weight Wt  Wt * S / ( S + L ) </li></ul><ul><li>P: Total number of docs sharing a N-Gram </li></ul><ul><li>S: Number of SPAM docs sharing N-Gram </li></ul><ul><li>L: P - S </li></ul>
  16. 16. An Example N-Gram Tree T5 T1 T2 T3 T4 3 rd 2 nd N1 2 nd 1 s t N2 N3 N 4
  17. 17. Combiner <ul><li>Mixture of Experts </li></ul><ul><ul><li>Get Predictions from all the Experts </li></ul></ul><ul><ul><li>Use the maximum common prediction </li></ul></ul><ul><ul><li>Use the prediction with maximum confidence score </li></ul></ul>
  18. 18. Conclusion <ul><ul><li>Objective is to Filter mail messages based on the preference of an individual </li></ul></ul><ul><ul><li>Classification performance increases with increased (incremental) training </li></ul></ul><ul><ul><li>Initial learning is not necessary for LSA, PLSA & N-Gram. </li></ul></ul><ul><ul><li>Performs unsupervised filtering </li></ul></ul><ul><ul><li>Performs fast prediction although background training is a relatively slower process </li></ul></ul>
  19. 19. References [1]I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. “An Evaluation of Naïve Bayesian Anti-Spam Filtering”, Proc. of the workshop on Machine Learning in the New Information Age, 2000. [2]W. Cohen, “Learning rules that classify e-mail”, AAAI Spring Symposium on Machine Learning in Information Access, 1996. [3] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch, “TiMBL: Tilburg Memory-Based Learner - version 4.0 Reference Guide”, 2001. [4] H. Drucker, D. Wu, and V. N. Vapnik., “Support Vector Machines for Spam Categorization”, IEEE Trans. on Neural networks, 1999. [5] D. Mertz, “Spam Filtering Techniques. Six approaches to eliminating unwanted e-mail.”, Gnosis Software Inc., September, 2002. Ciencias Físicas, Universidad de Valencia, 1992. [6] M. Vinther, “Junk Detection using neural networks”, MeeSoft Technical Report, June 2002. Available: http://logicnet.dk/reports/JunkDetection/JunkDetection.htm. [7] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. “Indexing By Latent Semantic Analysis”, Journal of the American Society For Information Science , 41, 391-407. (1990) [8] Sudarsun Santhiappan, Venkatesh Prabhu Gopalan, and Sathish Kumar Veeraswamy,”Role of Weighting on TDM in Improvising Performance of LSA on Text Data”, Proceedings of IEEE INDICON 2006 . [9] Thomas Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. 22 Int’l SIGIR Conf. on Research and Development in Information Retrieval, 1999 [10]Sudarsun Santhiappan, Dalou Kalaivendhan and Venkateswarlu Malapatti .” Unsupervised Contextual Keyword Relevance Learning and Measurement using PLSA”, Proceedings of IEEE INDICON 2006. [11]Landauer, T. K., Foltz, P. W., & Laham, D. “Introduction to Latent Semantic Analysis”, DiscourseProcesses, 25, 259-284. (1998). [12]G. Furnas, S. Deerwester, S. Dumais, T. Landauer, R. Harshman, L. Streeter and K. Lochbaum, &quot;Information retrieval using a singular value decomposition model of latent semantic structure,&quot; in The 11th International Conference on Research and Development in Information Retrieval, Grenoble, France: ACM Press , pp. 465--480. (1988) [13] Damashek, M. Gauging , “Similarity via N-Grams: Language-Independant Sorting, Categorization and Retrieval of Text”. Science , 267 . 843-848. [14] Sholomo Hershkop, Salvatore J.Stolfo , “Combining Email models for False Positive Reduction”, KDD’05, August 2005.
  20. 20. Any Queries…. ? You can post your queries to [email_address]

×