Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bayesian Inference using b8


Published on

Mike Creuzer's presentation from the June, 2009 meeting of the Suburban Chicago PHP & Web Development Meetup

Published in: Technology, Education
  • Be the first to comment

Bayesian Inference using b8

  1. 1. Bayesian Inferencing AKA  Naive Bayesian Filtering Using B8
  2. 2. Spam Filtering with b8: <ul><li>// Start using the bayesian filtering class </li></ul><ul><li>$b8 = new b8; </li></ul><ul><li>// Try to classify a text </li></ul><ul><li>$b8->classify('Hello World'); </li></ul><ul><li>// Show it something that isn't spam </li></ul><ul><li>echo $b8->classify(&quot;Everybody has a birthday&quot;); </li></ul><ul><li>$b8->learn(&quot;Everybody has a birthday&quot;, &quot;ham&quot;); </li></ul><ul><li>echo $b8->classify(&quot;Everybody has a birthday&quot;); </li></ul><ul><li>// Show it something that is spam </li></ul><ul><li>echo $b8->classify(&quot;Today isn't mine.&quot;); </li></ul><ul><li>$b8->learn(&quot;Today isn't mine.&quot;, &quot;spam&quot;); </li></ul><ul><li>echo $b8->classify(&quot;Today isn't mine.&quot;); </li></ul><ul><li>// Try to classify a text </li></ul><ul><li>echo $b8->classify(&quot;It's somebody's birthday today&quot;); </li></ul><ul><li>// Show it that this isn't spam too </li></ul><ul><li>echo $b8->classify(&quot;It's somebody's birthday today&quot;); </li></ul><ul><li>$b8->learn(&quot;It's somebody's birthday today&quot;, &quot;spam&quot;); </li></ul><ul><li>echo $b8->classify(&quot;It's somebody's birthday today&quot;); </li></ul><ul><li>. </li></ul><ul><li>// Lets try this one on for size </li></ul><ul><li>echo $b8->classify(&quot;Say Happy Birthday to Dave!&quot;); </li></ul><ul><li>// That was pretty quick, wasn't it? </li></ul><ul><li>Spaminess: could not calculate spaminess </li></ul><ul><li>Classification before learning: could not calculate spaminess </li></ul><ul><li>Saved the text as Ham </li></ul><ul><li>Classification after learning: could not calculate spaminess </li></ul><ul><li>Classification before learning: could not calculate spaminess </li></ul><ul><li>Saved the text as Spam </li></ul><ul><li>Classification after learning:  0.884615 </li></ul><ul><li>Spaminess: 0.583509 </li></ul><ul><li>Classification before learning:  0.583509 </li></ul><ul><li>Saved the text as Ham </li></ul><ul><li>Classification after learning:  0.105294 </li></ul><ul><li>Spaminess: 0.065217 </li></ul>
  3. 3. Any Questions?  
  4. 4. Good! I am glad I am not the only one... AKA Wikipedia to the rescue...
  5. 5. What is  Bayesian Inference  Statistics? <ul><li>In laymen's terms:  </li></ul><ul><li>A bunch of statistical mumbo-jumbo that learns from the past to allow you to classify in the future. </li></ul><ul><li>Or, more concisely, from Wikipedia: </li></ul><ul><li>Bayesian inference  is  statistical inference  in which evidence or observations are used to update or to newly infer the  probability  that a hypothesis may be true. The name &quot;Bayesian&quot; comes from the frequent use of  Bayes' theorem  in the inference process. Bayes' theorem was derived from the work of the Reverend  Thomas Bayes . </li></ul><ul><li> </li></ul>
  6. 6. Who? <ul><li>Thomas Bayes  (c. 1702 –  7 April   1761 ) was a  British   mathematician  and  Presbyterian  minister, known for having formulated a specific case of the theorem that bears his name:  Bayes' theorem , which was published posthumously. </li></ul><ul><li>Bayes' solution to a problem of &quot;inverse probability&quot; was presented in the  Essay Towards Solving a Problem in the Doctrine of Chances  (1764), published posthumously by his friend  Richard Price  in the  Philosophical Transactions of the Royal Society of London.  This essay contains a statement of a special case of  Bayes' theorem . </li></ul><ul><li> </li></ul>
  7. 7. Bayes Theorum <ul><li>Bayes' theorem relates the  conditional  and  marginal  probabilities of events  A  and  B , where  B  has a non-vanishing probability: </li></ul><ul><li>Each term in Bayes' theorem has a conventional name: </li></ul><ul><ul><li>P( A ) is the  prior probability  or  marginal probability  of  A . It is &quot;prior&quot; in the sense that it does not take into account any information about  B . </li></ul></ul><ul><ul><li>P( A | B ) is the  conditional probability  of  A , given  B . It is also called the  posterior probability  because it is derived from or depends upon the specified value of  B . </li></ul></ul><ul><ul><li>P( B | A ) is the conditional probability of  B  given  A . </li></ul></ul><ul><ul><li>P( B ) is the prior or marginal probability of  B , and acts as a  normalizing constant . </li></ul></ul><ul><li>Intuitively, Bayes' theorem in this form describes the way in which one's beliefs about observing 'A' are updated by having observed 'B'. </li></ul><ul><li>Objective Bayesians emphasise that these probabilities are fixed by a body of well-specified background knowledge (K), so their version of the theorem expresses this: [5][2] </li></ul>'_theorem
  8. 8. Duh... an example please? <ul><li>Suppose there is a co-ed school having 60% boys and 40% girls as students. The girl students wear trousers or skirts in equal numbers; the boys all wear trousers. An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers. What is the probability this student is a girl? The correct answer can be computed using Bayes' theorem. </li></ul><ul><li>The event A is that the student observed is a girl, and the event B is that the student observed is wearing trousers. To compute P(A|B), we first need to know: </li></ul><ul><ul><li>P(A), or the probability that the student is a girl regardless of any other information. Since the observers sees a random student, meaning that all students have the same probability of being observed, and the fraction of girls among the students is 40%, this probability equals 0.4. </li></ul></ul><ul><ul><li>P(A'), or the probability that the student is a boy regardless of any other information (A' is the complementary event to A). This is 60%, or 0.6. </li></ul></ul><ul><ul><li>P(B|A), or the probability of the student wearing trousers given that the student is a girl. As they are as likely to wear skirts as trousers, this is 0.5. </li></ul></ul><ul><ul><li>P(B|A'), or the probability of the student wearing trousers given that the student is a boy. This is given as 1. </li></ul></ul><ul><ul><li>P(B), or the probability of a (randomly selected) student wearing trousers regardless of any other information. Since P(B) = P(B|A)P(A) + P(B|A')P(A'), this is 0.5×0.4 + 1×0.6 = 0.8. </li></ul></ul><ul><li>Given all this information, the probability of the observer having spotted a girl given that the observed student is wearing trousers can be computed by substituting these values in the formula: </li></ul><ul><li>'_theorem </li></ul>
  9. 9. Yeah... I can't really get my head around it either... <ul><li>There are lots of resources online to help. </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li>  Really cool explanation using Venn Diagrams </li></ul></ul><ul><ul><li> </li></ul></ul>
  10. 10. Which brings us back to b8 <ul><li>b8 is a naive Bayesian Spam filter library written by Tobias Leupold. </li></ul><ul><li> </li></ul><ul><li>Why use a class library?  </li></ul><ul><li>- for the usual reasons - </li></ul><ul><ul><li>Written by somebody who knows more about the problem </li></ul></ul><ul><ul><li>It just works with a minimum of fuss </li></ul></ul><ul><ul><li>Many of the 'gotcha's' and edge cases should be resolved </li></ul></ul><ul><ul><li>Published code reviewed by many people. </li></ul></ul><ul><li>It does all that stuff from two slides ago as easily as you saw on the 2nd slide. </li></ul>
  11. 11. b8's 'target audience' <ul><li>b8 is designed - optimized really -  to classify very short messages, such as comments left on blogs. </li></ul><ul><li>b8 accepts only a single text string for classification. No header, body distinction. </li></ul><ul><li>b8 tallies the number of instances of a word. It can distinguish between a single URL in a comment vs 20 links. </li></ul><ul><li>The author claims it may not be suited to longer text strings such as email messages. </li></ul>
  12. 12. How does b8 work? <ul><ul><li>b8 'tokenizes' a string into individual words, URL bits & pieces, IP addresses, HTML tags, etc. </li></ul></ul><ul><ul><ul><li>You can create your own 'lexer' if you want different tokens </li></ul></ul></ul><ul><ul><ul><li>Tokens that aren't in the existing known token list go through a 'degenerater' process to try to find similar tokens. </li></ul></ul></ul><ul><ul><li>b8 picks the 15 (configurable) most interesting (farthest from a score of .5) tokens to calculate the probability with. </li></ul></ul><ul><li>b8 can also 'learn' that a text's set of tokens represents spam or not. It will use this new data for future classifications. </li></ul>
  13. 13. How the default Lexer creates Tokens <ul><li>The lexer is where you can really give b8 it's 'smarts' as you can define how the individual tokens are created. </li></ul><ul><li>The default Lexer tries to find all IP addresses and URL looking strings in the provided text. It then breaks the URLs into bits, using both the whole URL, and the individual elements of it for tokens. </li></ul><ul><li>The default Lexer also tries to pull out the HTML tags to use as tags as well. </li></ul><ul><li>Remember, it was originally written to combat blog comment spam, which is primarily links to websites. </li></ul>
  14. 14. What else can you have the Lexer do? <ul><li>With a little insight into the text strings your trying to classify, you can make the Lexer quite intelligent in creating tokens. </li></ul><ul><li>For email classification, you can create a token out of the SPF record lookup in the header. You could also create a token out of the spam score header line added by your email host's spam filter. </li></ul><ul><li>Some Bayesian implementations will tokenize on phrases, so sentence structure can be utilized instead of just a list of words. </li></ul><ul><li>Doing this will allow the following two phrases using the words 'buy' and 'now' to be distinguished. </li></ul><ul><li>&quot;Now I know what to buy&quot; &quot;24 Hour Sale! Buy Now&quot;  </li></ul>
  15. 15. Degeneration <ul><li>b8 will take a token that it hasn't seen before and do several transforms on it trying to find it in the existing corpus of known tokens. If a degenerated version is found, it picks the most interesting one for scoring. </li></ul><ul><li>b8 only does this for scoring text. It will not saved degenerated tokens. </li></ul><ul><li>The degeneration process currently has several different transforms to it. </li></ul><ul><ul><li>lowercase the whole token </li></ul></ul><ul><ul><li>uppercase the whole token </li></ul></ul><ul><ul><li>capitalize the first letter of the token </li></ul></ul><ul><ul><li>remove punctuation from token such as: . ! ? </li></ul></ul>
  16. 16. Learning about spam and not spam <ul><li>b8 saves each token into a database with a count of the number of instances it was seen in both spam texts and not spam texts. </li></ul><ul><li>b8 also will save when each token was last seen, but I don't know if this is really used or was just curiosity on the authors behalf. </li></ul><ul><li>When a token exists in the database, it updates the spam/not spam counts. </li></ul><ul><li>b8 counts each instance of a token in each learned text, not just a single instance of a token for a given text. It's possible for the counts to exceed the total number of texts learned. </li></ul>
  17. 17. forgetaboutit... AKA [crt]-z <ul><li>b8 can also unlearn a text string. </li></ul><ul><li>This is useful if you accidentally flagged a text one way or the other and the message was really the other way. </li></ul><ul><li>This is also useful because some implementations will auto-learn high probability spam messages as spam messages. This can be done to make the system adaptive to changes in spamming tactics as the changes are seen. New tactics seen with at the same time as the old will automatically be learned to be spam. </li></ul><ul><li>There is a potential problem that you can unlearn a text that was never learned in the first place, so beware. </li></ul>
  18. 18. The future of b8 <ul><li>The author and another individual currently have the next version .5 in .svn development. It's basically a total re-write of everything but the core Bayesian math processing. </li></ul><ul><li>This new version is a complete PHP5 native rewrite. </li></ul><ul><li>MySQL query usage is much more efficient, providing a significant speed increase. </li></ul><ul><li>Work is being done into multiple categorization - not just spam/not spam. This looks to significantly complicate the code, so it isn't likely to be a .5 feature. </li></ul>
  19. 19. Bayesian Poisoning <ul><li>The idea is to provide enough otherwise innocuous text that the 'spam' message is lost amongst the non-spam message. </li></ul><ul><li>There are several ways this is done. </li></ul><ul><li>Random dictionary words. </li></ul><ul><li>Short text snippets from various sources, such as Shakespeare, Wikipedia or news websites. </li></ul><ul><li>The spam message is embedded into an image file, where the Bayesian inference engine can't see it. </li></ul>