Spam Clustering


Published on

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Spam Clustering

  1. 1. Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means
  2. 2. You’ll be hearing quite a lot about… <ul><li>Spam signatures </li></ul><ul><ul><li>Previous approaches </li></ul></ul><ul><ul><li>Spam Features </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>K-Means </li></ul></ul><ul><ul><li>K-Medoids </li></ul></ul><ul><ul><li>Stream clustering </li></ul></ul><ul><li>Constraints </li></ul>
  3. 3. You’ll be hearing quite a lot about… <ul><li>Spam signatures </li></ul><ul><ul><li>Previous approaches </li></ul></ul><ul><ul><li>Spam Features </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>K-Means </li></ul></ul><ul><ul><li>K-Medoids </li></ul></ul><ul><ul><li>Stream clustering </li></ul></ul><ul><li>Constraints </li></ul>
  4. 4. You’ll be hearing quite a lot about… <ul><li>Spam signatures </li></ul><ul><ul><li>Previous approaches </li></ul></ul><ul><ul><li>Spam Features </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>K-Means </li></ul></ul><ul><ul><li>K-Medoids </li></ul></ul><ul><ul><li>Stream clustering </li></ul></ul><ul><li>Constraints </li></ul>
  5. 5. And we’ll connect the dots
  6. 6. But the essence is… <ul><li>&quot;A nation that forgets its past is doomed to repeat it.&quot; </li></ul><ul><li>Winston Churchill </li></ul>
  7. 7. And finally some result charts
  8. 8. <ul><li>Strong relation with dentistry </li></ul><ul><li>Necessary Evil ? </li></ul><ul><li>Last resort </li></ul>Spam signatures
  9. 9. Spam signatures (2) <ul><li>Most annoying problem is that they are labor intensive </li></ul><ul><li>An extension of filtering email by hand </li></ul><ul><li>More automation is badly needed to make signatures work </li></ul>
  10. 10. Spam features <ul><li>The ki of the spam business </li></ul><ul><li>Its DNA </li></ul><ul><li>Everything and yet nothing </li></ul><ul><li>Anything that has a constant value in a given spam wave </li></ul>
  11. 11. Email Layout <ul><li>We noticed then that though spammers tend to change everything in an email to conceal the fact that it’s actually spam, they tend to preserve a certain layout. </li></ul><ul><li>We encoded the layout of a message in a string of tokens such as 141L2211. </li></ul><ul><li>This later evolved in a message summary such as BWWWLWWNWWE </li></ul><ul><li>To this day, message layout is the most effective feature </li></ul><ul><li>We also use variations of this feature for the MIME parts, for the paragraph contents and so on. </li></ul>
  12. 12. Other Spam Features - headers <ul><li>Subject length, the number of separators, the maximum length of any word </li></ul><ul><li>The number of received fields(turned out we were drunk and high when we chose this one) </li></ul><ul><li>Whether it had a name in the from field </li></ul><ul><li>A quite nice example is the stripped date format </li></ul><ul><ul><li>Take the date field </li></ul></ul><ul><ul><li>Strip it of all alpha-numeric characters </li></ul></ul><ul><ul><li>Store what’s left </li></ul></ul><ul><ul><li>“ ,    :: - ()” or “,    :: +” or “,    :: + ” </li></ul></ul><ul><li>Any more suggestions? </li></ul>
  13. 13. Other Spam Features – body <ul><li>Its length; the number of lines; whether it has long paragraphs or not; the number of consecutive blank lines; </li></ul><ul><ul><li>Basically any part of the email layout that we felt was more important than the average </li></ul></ul><ul><li>The number of links/email addresses/phone numbers </li></ul><ul><li>Bayes poison </li></ul><ul><li>Attatchments </li></ul><ul><li>Etc. </li></ul>
  14. 14. Combining features (1) <ul><li>One stick is easy to break </li></ul><ul><li>The Roman fasces symbolized power and authority </li></ul><ul><li>The symbol of strength through unity from the Roman Empire to the U.S. </li></ul><ul><li>The most obvious problem – our sticks are different. </li></ul><ul><ul><li>Strings, integers, bools </li></ul></ul><ul><ul><li>I’ll stress this later </li></ul></ul>fasces lictoriae (bundles of the lictors)
  15. 15. Combining features (2) <ul><li>If it’s an A and at the same time a B then it’s spam </li></ul><ul><li>The idea of combining features never died out </li></ul><ul><li>Started with its relaxed form – adding scores </li></ul><ul><ul><li>if it has “Viagra” in it – increase its spam score by 10%. </li></ul></ul><ul><li>Evolution came naturally </li></ul><ul><li>National Guard Bureau insignia </li></ul>
  16. 16. Why cluster spam? <ul><li>A “well doh” kind of slide </li></ul><ul><li>To extract the patterns we want </li></ul><ul><ul><li>How do we combine spam traits to get a reliable spam pattern ? </li></ul></ul><ul><ul><li>And which are the traits that matter most? </li></ul></ul><ul><li>Agglomerative clustering is just one of many options </li></ul><ul><ul><li>Neural Networks </li></ul></ul><ul><ul><li>ARTMap worked beautifully on separating ham from spam </li></ul></ul>
  17. 17. So why agglomerative? <ul><li>Because the problem stated before is wrong </li></ul><ul><li>We don’t just want spam patterns. </li></ul><ul><ul><li>We want patterns for that spam wave alone </li></ul></ul><ul><li>Most neural nets make a binary decision. We want a plurality of classes. </li></ul><ul><li>Still there are other options, like SVM’s. </li></ul><ul><ul><li>They don’t handle well on clustering strings </li></ul></ul><ul><ul><li>We want something that accepts just about any feature as long as you can compute a distance </li></ul></ul>
  18. 18. K-means and K-medoids <ul><li>So we chose the simplest of methods – the widely popular K-Means </li></ul><ul><ul><li>In a given feature space each item to be classified is a point. </li></ul></ul><ul><ul><li>The distance between the points indicates the resemblance of the original items. </li></ul></ul><ul><ul><li>From a given set of instances to be clustered, it creates k classes based on their similarity </li></ul></ul><ul><li>For spaces where the mean of two points cannot be computed, there is a variety of k-means: k-medoids. </li></ul><ul><ul><li>This actually solves the different stick problem </li></ul></ul><ul><ul><li>As usual by solving a problem we introduce a whole range of others. </li></ul></ul><ul><li>Combining them </li></ul>
  19. 19. An Example <ul><li>Is it a line or a square? </li></ul><ul><li>What about string features? </li></ul>
  20. 20. Our old model <ul><li>Focus mainly on correctly defining some powerful spam features </li></ul><ul><li>We totally neglected the clustering part </li></ul><ul><ul><li>So we used the good old fashioned k-means and k-medoids. </li></ul></ul><ul><ul><li>And they have serious drawbacks </li></ul></ul><ul><ul><li>A fixed number of classes. </li></ul></ul><ul><ul><li>Work only with an offline corpus </li></ul></ul><ul><li>The results were... Unpredictable. </li></ul><ul><li>Luck played a major role. </li></ul>
  21. 21. WOKM – Wave oriented K-Means <ul><li>By using the simple k-means we could only cluster individual sets of emails </li></ul><ul><li>We now needed to cluster the whole incoming stream of spam </li></ul><ul><li>We also want to store a history of the clusters we extract </li></ul><ul><ul><li>And use that information to detect spam on the user side. </li></ul></ul><ul><ul><li>And also to help us better classify in the future </li></ul></ul><ul><ul><ul><li>Remember Churchill? </li></ul></ul></ul>
  22. 22. WOKM – How does it work ? <ul><li>Takes snapshots of the incoming spam stream </li></ul><ul><li>Takes in only what is new </li></ul><ul><li>Train it on those messages </li></ul><ul><li>Store the clusters for future reference </li></ul>
  23. 23. The spam corpus <ul><li>All the changes originate here </li></ul><ul><ul><li>All messages have an associated distance </li></ul></ul><ul><ul><li>The distance from them to the closest stored cluster in the cluster history </li></ul></ul><ul><li>New clusters must be closer than old ones </li></ul><ul><li>Constrained K-Means </li></ul><ul><ul><li>Wagstaff&Cardie, 2001 </li></ul></ul><ul><ul><li>“ must fit” or “must not fit” </li></ul></ul><ul><ul><li>A history constraint </li></ul></ul>
  24. 24. The training phase <ul><li>While a solution has not been found: </li></ul><ul><ul><li>Unassigned all the given examples </li></ul></ul><ul><ul><li>Assign all examples </li></ul></ul><ul><ul><ul><li>Create a given number of clusters </li></ul></ul></ul><ul><ul><ul><li>Assign what you can </li></ul></ul></ul><ul><ul><ul><li>Create some more and repeat the process </li></ul></ul></ul><ul><ul><li>Recompute centers </li></ul></ul><ul><ul><li>Merge adjacent(similar) clusters </li></ul></ul><ul><ul><ul><li>Counters the cluster inflation brought by the assign phase </li></ul></ul></ul><ul><ul><li>Test solution </li></ul></ul>
  25. 25. What’s worth remembering <ul><li>Accepts just about any kind of feature – Booleans, integers and strings. </li></ul><ul><li>K-means is limited because you have to know the number of classes a priori. </li></ul><ul><ul><li>WOKM determines the optimum number of classes automatically </li></ul></ul><ul><li>New messages will not be assigned to clusters that are not considered close enough </li></ul><ul><li>Has a fast novelty detection phase, so it can train itself only with new spam. </li></ul><ul><li>Can use the triangle inequality to speed things up. </li></ul><ul><li>(Future work) Allows us to keep track of the changes spammers make in the design of their products. </li></ul><ul><ul><li>By watching clusters that are close to each other </li></ul></ul>
  26. 26. Results <ul><li>Perhaps the most exciting results – the cross language spam clusters </li></ul>
  27. 27. Results(2) <ul><li>Then in spanish </li></ul><ul><li>We were surprised to find that this is not an isolated case. YouTube, Microsoft, Facebook fraud attempts also were found in multiple languages </li></ul>
  28. 28. Results(3) <ul><li>Then again in french (different though) </li></ul>
  29. 29. And finally the promised charts
  30. 30. And finally the promised charts (2)