Machine Learning
  in JavaScript
Jamison Dance
i.tv
@jergason
http://jamisondance.com
Smart?
Smart
Curious
Two years
•Secret sauce at SpotterRF

•StackOverflow.com
•Random side projects
What is
 it?
Math
Math
Math
Math
“Making computers
 modify or adapt their
actions so these actions
 get more accurate.” -
   Stephen Marsland
Teaching computers to
  recognize patterns
Why do
you care?
An avalanche
  of data
“The purpose of
computing is insight, not
  numbers.” - Richard
      Hamming
Why
JavaScript?
Atwood’s
  Law
Atwood’s
  Law
Naive Bayes
  and puppies
Spam Filtering?
Illustrious sir/madame,
I have recently acquired a bounteous
cache of 7 million semicolons. They can all
be yours if you send a money order for
30 semicolons.
Most graciously,
A spammer
An idea
• Count up all words in all spam
• Count up all words in not-spam
• Compare counts to words in new
 documents
Class




Word 1   Word 2      Word 3   Word 4
Spam / Not Spam




Word 1   Word 2   Word 3    Word 4
Spam / Not Spam




Word 1     Word 2   Word 3    Word 4


Spam Not
 50   45
Spam / Not Spam




Word 1   Word 2     Word 3   Word 4

         Spam Not
          15   27
Spam / Not Spam




Word 1   Word 2   Word 3     Word 4


                  Spam Not
                   33   14
Spam / Not Spam




Word 1   Word 2   Word 3    Word 4

                            Spam Not
                             4   55
Emergency!
 Puppies!
Bayes Theorem
P(A|B) = (P(B|A)*P(A)) / P(B)
In English
P(class|email) = (P(email|class)*P(class)) /
     P(word1 and word2 and word3)
P(class|email) = (P(email|class)*P(class))/P(email)
“Spam” or “Not Spam”




 P(class|email) = (P(email|class)*P(class))/P(email)
Words in the email




P(class|email) = (P(email|class)*P(class))/P(email)
What we think the
                         probability of spam or
                              not spam is



P(class|email) = (P(email|class)*P(class))/P(email)
P(spam|email) = (P(email|spam)*P(spam))/P(email)



 P(not spam|email) = (P(email|not spam)*P(not
              spam))/P(email)



     Pick the largest one
P(spam|email) = (P(email|spam)*P(spam))/P(email)



 P(not spam|email) = (P(email|not spam)*P(not
              spam))/P(email)



     These are the same
P(spam|email) = (P(email|spam)*P(spam))



P(not spam|email) = (P(email|not spam)*
            P(not spam))


  Assume these are the
         same
P(spam|email) = P(email|spam)




P(not spam|email) = P(email|not spam)
P(words|spam) = P(word1|spam) *
P(word2|spam) . . . * P(word_n|spam)
Emergency!
 Kittens!
In The Wild
Everyone Hates
 Hacker News
Crap
Crap
Crap
Crap
Crap
       Crap
Crap
Crap
     Crap
Crap
Crap
Crap
      Crap
Crap
 Crap
Crap
 Crap
       Crap
 Crap
  Crap
Crap
Crap
 Crap
       Crap
 Crap
  Crap
Crap
      Crap
Crap
 Crap
       Crap
 Crap
  Crap
Crap
      Crap
Crap
Crap
 Crap
       Crap
 Crap
  Crap
Crap
      Crap
Crap
  Crap
Crap
 Crap
       Crap
 Crap
  Crap
Crap
      Crap
Crap
  Crap
      Crap
Crap
 Crap
       Crap
 Crap
  Crap
Crap
      Crap
Crap
  Crap
      Crap
Crap
Mostly
 Crap
 Crap
 Crap
  Crap
       Crap




Crap
Crap
      Crap
Crap
  Crap
      Crap
Crap
Yehuda Katz
hurt my feelings



                   rails sucks
                   node rules
                        lol
Yehuda Katz
hurt my feelings




DRAMA
                   rails sucks
                   node rules
                        lol
Some Good Stuff
Automatically Find
 The Good Stuff
Step 1
Gather Data
scraping with
    jsdom
storage with
 mongoose
An aside
90% data prep
  10% learning
naive bayes with
   credulous
Like / Dislike




Word 1 . . . Word n username   hostname
programmer ui
  with node
recommend posts
   from HN api
In The Browser

Machine Learning in JavaScript