Do Humans Beat Computers At Pattern Recognition


Published on

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Do Humans Beat Computers At Pattern Recognition

  1. 1. Do humans beat computers at pattern recognition? Andra Miloiu Costina Spam Analyst
  2. 2. <ul><li>Do humans beat computers at pattern recognition? </li></ul><ul><li>NO </li></ul><ul><li>YES </li></ul>What do you think?
  3. 3. What is the correct answer?
  4. 4.  NO!
  5. 5.  NO!
  6. 6. <ul><li>Each time we answered “NO” one of the following automated signatures mechanism was designed: </li></ul><ul><li>Patterns extraction; </li></ul><ul><li>Lines detection; </li></ul><ul><li>Cluster base rules generation; </li></ul><ul><li>Automated signatures creation; </li></ul> NO!
  7. 7. Why aren’t we all on a beach?
  8. 8. PATTERN EXTRACTION <ul><li>Short description: </li></ul><ul><li>Thus the mechanism is conceptually divided into four steps: one that finds groups of similar emails – layout based filtering, a second that extracts information for each group – a pattern discovery algorithm, a third that determines the utility of each extracted feature – a version of the Relief algorithm, and finally one that fits the pieces together, creating the signatures – a genetic algorithm. </li></ul><ul><li>- Pattern extraction mechanism like Teiresias and basic suffix tree </li></ul><ul><li>- Pro & cons: +It was among the first methods of automated pattern extraction that we designed. –It was difficult to use and an analyst would have finished the signature a lot faster; </li></ul><ul><li>Stats: It brought an increase in our detection rate of 2%. </li></ul>
  9. 9. What did we do next? … LINES DETECTION
  10. 10. LINES DETECTION(1) <ul><li>How did spam looked at that time? </li></ul><ul><li>Almost a year and a half ago, spam waves took a new turn. The number of lines in a spam message decreased to 1 or 2 spammy lines and one URL. </li></ul>
  11. 11. LINES DETECTION(2) This type of waves came in such big numbers that it affected our response time, therefore we thought of implementing a system which would sign these spams in a shorter period of time.
  12. 12. LINES DETECTION <ul><li>Short description: </li></ul><ul><li>Basically the mechanism worked in three steps: </li></ul><ul><li>Extracting the pattern represented by a relevant text line; </li></ul><ul><li>Each line was associated with the number of apparitions and the it was sorted descending; </li></ul><ul><li>Automated signatures ware created for the top relevant lines. </li></ul><ul><li>- Pattern extraction mechanism: </li></ul><ul><li>Based on a predefined set of key words, the program would extract the lines containing relevant information; </li></ul>
  13. 13. LINES DETECTION For instance:
  14. 14. LINES DETECTION <ul><li>-While in use, this system increased our response time by 6.4% and helped us sign a series of spam waves which otherwise would have taken an analyst much more time to handle. </li></ul><ul><li>The C.O.D. was mainly the decreasing number of spam waves bearing the same relevant phrases in more than 40% of the cases. </li></ul><ul><li>The different statements used to express the same point : “Buy Replica Watches”, made us change the perspective on how to create lasting signatures. </li></ul>
  16. 16. CLUSTER BASE RULES GENERATION <ul><li>Short description: </li></ul><ul><li>Mails are clustered; </li></ul><ul><li>The clusters are seen by an analyst; </li></ul><ul><li>3. The analyst adds a simple content related pattern and creates the signature; </li></ul><ul><li>- Pattern extraction mechanism </li></ul><ul><li>In comparison with the previously described system which was entirely based on the content of a spam message, the cluster base rules rely on patterns belonging to the email’s template, such as: the body summary, the date format, the number of URL or the number of separators found in the subject. </li></ul>
  17. 17. CLUSTER BASE RULES GENERATION - Pro & Cons The great advantage given by this system is it’s universal appliance. There are no messages that can’t be clustered. Therefore the predefined set o features are calculated for each email. The features based on the email’s template alone are not enough to mark an email as spam, as more and more of these messages copy the template used by regular/legit emails. Hence we are working on new features that will allow the cluster based rules to tag emails as spam without the intervention of an analyst.
  18. 18. AUTOMATED SIGNATURES CREATION <ul><li>Short description: </li></ul><ul><li>Until a few month ago we were considering that an automated pattern extraction mechanism wouldn’t be very efficient taking into account the current variety found in spam belonging to the same wave. </li></ul><ul><li>By simplifying the process we get 4 steps: </li></ul><ul><li>Extracts patterns from a pool of spam; </li></ul><ul><li>Sorts them by the number of apparitions; </li></ul><ul><li>Creates automated signatures; </li></ul><ul><li>Tests the newly created signs; </li></ul><ul><li>Sends them for a FP test; </li></ul>
  19. 19. AUTOMATED SIGNATURES CREATION <ul><li>- Pattern extraction mechanism </li></ul><ul><li>If the line extraction mechanism was based on a set of keywords to define the relevant phrases, this system extracts almost all the lines from a spam message (body and headers). Afterwards it eliminates the patterns which contain only html tags or lines shorter than a predefined threshold. </li></ul><ul><li>Pro & Cons </li></ul><ul><li>+Helps decrease the reaction time; </li></ul><ul><li>+Doesn’t create FPs; </li></ul><ul><li>-It still needs an analyst to validate the resulting signatures; </li></ul>
  20. 20. Overview All these systems are a step closer toward a fully automated mechanism of creating signatures. The most important advantage brought is that of better reaction time and an increase of the detection rate by 5%-10%. There are no FPs , as all the systems in use are overlooked by analysts and they make the final decision of whether a signature is good or not.
  21. 21.  NO! What methods of automated pattern recognition have you developed?
  22. 22. <ul><li>Do humans beat computers at pattern recognition? </li></ul><ul><li>NO </li></ul><ul><li>YES </li></ul>What do you think?
  23. 23. If ( YES) { ANALYSTS RULE }
  24. 24. Short description: We are a team of 10 people, full of enthusiasm and desire of putting an end to spam. What makes us great? Our enhanced senses of recognizing patterns. ANALYSTS TEAM
  25. 25. <ul><li>- Pros & Cons </li></ul><ul><li>+ We can find a pattern in any given spam; </li></ul><ul><li>+ We know when is safe to say “This is spam”; </li></ul><ul><li>+ We adapt to any situation; </li></ul><ul><li>+ We can predict certain evolution of spam waves and be proactive about it; </li></ul><ul><li>+ We can maintain a detection rate of over 97%; </li></ul><ul><li>We are expensive; </li></ul><ul><li>We have a longer reaction time ; </li></ul><ul><li>We sometimes make mistakes… we’re just humans after all; </li></ul>ANALYSTS TEAM
  26. 26. <ul><li>Automated pattern extraction mechanisms </li></ul><ul><li>- Shorter reaction time; </li></ul><ul><li>Work only for some spam waves; </li></ul><ul><li>- Are less expensive; </li></ul><ul><li>Analysts team </li></ul><ul><li>Longer reaction time; </li></ul><ul><li>Can extract a pattern for any spam wave; </li></ul><ul><li>Cost a lot; </li></ul>A few ..conclusions
  27. 27. Q&A Andra Miloiu [email_address]