Natural language processing with naive bayes

1,339 views
1,166 views

Published on

A little talk I gave about NLP with naive bayes for classification. I used the ideas to build http://twedar.herokuapp.com, and a client-side classifier for Skimlinks

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,339
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
17
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Natural language processing with naive bayes

  1. 1. Natural Language Processing with Naïve Bayes Tim Ruffles @timruffles
  2. 2. Overview ● Intro to Natural Language Processing ● Intro to Bayes ● Bayesian Maths ● Bayes applied to Natural Language Processing
  3. 3. NLP (not like Derren Brown)
  4. 4. Processing text ● Named entity recognition - Skimwords ● Information retrieval - Google ● Information extraction - IBM's Watson ● Interpreting - sentiment, named entities ● Classification - spam vs not spam ● Speech to text - Siri
  5. 5. Named entity recognition
  6. 6. Classification From: Prime Minister of Nigeria Subject: Opportunity Dear Sir, My country vexes me; I wish to leave. Please give me your bank account information for instantaneous enrichment, no danger to you! Yours in good faith and honour, Mr P. Minister From: sally@gmail.com Subject: cats lol this cat is really fat http://reddit.com/r/fat-cats/roflolcoptor- fat-cat-dancing.gif spam: 99% ham: 1% spam: 1% ham: 99%
  7. 7. Example Task
  8. 8. Identify Product References
  9. 9. How do humans do this? ● Algorithms are far dumber than you ● If you don't have enough info, an algorithm will not help ● Anyone can identify features required for natural language processing
  10. 10. Features The new cameras are the Canon PowerShot S100, the Nikon J1 and the Olympus PEN.
  11. 11. Types of features ● Word shape (capitalization, numbers etc) ● Tag context - near a product ● Dictionary/gazette - list of brands ● Part of speech - verb, noun ● n-Grams - products contain only one brand
  12. 12. NLP process
  13. 13. The new cameras are Canon's PowerShot S100, the Nikon J1 and the Olympus PEN.
  14. 14. The new cameras are Canon's PowerShot S100, the Nikon J1 and the Olympus PEN. Supervision
  15. 15. The new cameras are , the and the Canon's PowerShot S100 Nikon J1 Olympus PEN. Feature extraction
  16. 16. The new cameras are , the and the Canon's PowerShot S100 Nikon J1 Olympus PEN. Correlate features & tags capital in middle of sentence: 0 capital in middle of word: 0 acronyms: 0 words with numbers in them: 0 capital in middle of sentence: 7 capital in middle of word: 1 acronyms: 1 words with numbers in them: 2
  17. 17. NLP Overview Supervision with tagged data Training up a model Test model on test set Model ready to use
  18. 18. Nuts and bolts Supervision - create a test set of labelled data Normalisation and clean-up (Canon's -> Canon etc) Feature extraction and training on training set Validate on test set
  19. 19. How to use features/tags to tag products? ● Need to a method for using our correlated feature/tag sets to learn from and predict mathematically ● One such method is...
  20. 20. Naïve Bayes
  21. 21. When my information changes, I alter my conclusions. What do you do, sir? Keynes
  22. 22. Mathematically updating our beliefs on evidence
  23. 23. Bayes: local hero
  24. 24. Thomas Bayes
  25. 25. An Essay towards solving a Problem in the Doctrine of Chances 1763
  26. 26. Example applications ● Given a drug test result, how likely is it a person has taken drugs? ● Give these words, how likely is it that this email is spam? ● Given these words, how likely is it they refer to a product?
  27. 27. Estimate ● 99% accurate drug test ● 1% of people actually take drugs Given the above, what is the probability that someone indicated as drug positive by the test is a drug user?
  28. 28. Place your bets
  29. 29. 50%
  30. 30. The Maths
  31. 31. A Little Notation Probability 0 0.5 1 Impossible You'd never bet on it happening Likely as not - evens Best odds you'd get would be 1/2 Certain You'd never bet against
  32. 32. More notation P(spam) Probability of spam P(^spam) Probability of not spam P(spam|features) Probability of spam given some features
  33. 33. A few rules P(6,6) = P(6)P(6) = 1/6 x 1/6 = 1/36 Probability of rolling 6 twice P(^6) = 1 - P(6) = 1 - 1/6 = 5/6 Probability of not rolling six is inverse of rolling a six
  34. 34. Independence P(A,B) = P(A)P(B) Only applies if two events are independent. Events are independent if the one having happened has no bearing on how likely it is the other will.
  35. 35. Dependence is informative e.g: if someone is paler than normal, they could be sick P(sick|pale) ≠ P(sick) if someone fails a drug test, they could be a drug user
  36. 36. P(A|B)? What is the probability of A, given that B has happened?
  37. 37. Drugs test ● 99% accurate, 1% of people take drugs ● Prior probability that someone is a drug user: 1% ● 1% chance of a false positive Probability of something not happening is inverse of it happening.
  38. 38. Priors (pre information) Prior: drug use P(drug use) = 0.01 = 1/100 = 1% Prior: false positive P(false positive) = 0.01 = 1/100 = 1%
  39. 39. A drug test is asking P(drug user | positive drug test)
  40. 40. Union of a signal and an event P(drug user | positive drug test) P(event | signal)
  41. 41. We can see an signal at least 2 ways Can see a positive in 2 ways: P(drug user , positive drug test) P(non user , positive drug test) or a negative in two ways: P(drug user , negative drug test) P(non user , negative drug test)
  42. 42. The theorem The chance of an event given a signal is the ratio of: the prior probability of the event multiplied by that of seeing the signal given the event to all the ways you could see that signal.
  43. 43. The calculation P(drug user | positive drug test) = P(drug user) x P(positive drug test | drug user) P(positive drug test)
  44. 44. Estimate ● 99% accurate drug test ● 1% of people take drugs Given the above, what is the chance someone failing the drugs test is a drug user?
  45. 45. Place your bets
  46. 46. 50%
  47. 47. The calculation P(drug user | positive drug test) = 1/100 x 99/100 P(positive drug test)
  48. 48. P(B)?
  49. 49. P(B) ● All the ways you could see the signal ∑ P(event) x P(signal | event) (∑ is sum of, ie add all the things)
  50. 50. P(B) ● In our case there are two possibilities - person is either a drug user or not - we already know the result of the test P(user) x P(positive | user) P(A) x P(B|A) + + P(clean) x P(positive | clean) P(^A) x P(B|^A)
  51. 51. The calculation P(drug user | positive drug test) = 1/100 x 99/100 (1/100 x 99/100) + (99/100 x 1/100)
  52. 52. The calculation P(drug user | positive drug test) = 1 x 99 (1 x 99) + (99 x 1)
  53. 53. The calculation P(drug user | positive drug test) = 99 99(1 + 1)
  54. 54. The calculation P(drug user | positive drug test) = 1 2
  55. 55. Maths applied to NLP
  56. 56. Building a spam filter ● Using what we know about Bayes, we're going to build an NLP spam filter ● We'll use n-grams as our features - the number of times we have seen each word ● 1-gram is each word, 2-grams are pairs of words: 2-grams are more accurate but more complex
  57. 57. 1-grams Dear Sir, Give me your bank account. I will transfer money from my bank account to your bank account. Yours in good faith and honour, Mr P. Minister Hi, Lovely to see you last night. I'll pay you back for the film - just give me your bank account details. Cheers, Sally x bank 3 account 3 your 2 from 1 give 1 dear 1 sir 1 i 1 will 1 transfer 1 money 1 me 1 my 1 to 1 you 2 the 1 to 1 see 1 hi 1 last 1 night 1 ill 1 pay 1 back 1 for 1 me 1 your 1 bank 1 account 1 details 1 30 words total 28 words total
  58. 58. 1-grams bank 3 / 30 = 1/10 words is bank for spam Give me your bank account. I will transfer money from my bank account to your bank account. P(bank,bank,bank|spam) P(bank) * P(bank) * P(bank) = 1/1000 P(bank,bank,bank|ham) P(bank) * P(bank) * P(bank) = 1/21,952
  59. 59. Smoothing 1-grams 24 unique words Count each word as (count(word) + smooth) countWords + (smooth*uniqueWords) Laplacian smooth - take a bit of probability away from each of our words to give to words we've not seen before.
  60. 60. Smoothing 1-grams (count(word) + smooth) countWords + (smooth*uniqueWords) P(bank) = 1+0.1 / 28 + (0.1 * 24) P(bank) = 1.1 / 30.4 P(sesquipedalian) = 0.1 / 30.4
  61. 61. Applied smoothing P(lovely,film|spam) P(lovely) = 0 + 0.1 / 30 + (24 * 0.1) P(lovely) = 0.1 / 32.4 P(lovely) * P(film) = (0.1 / 32.4)2 P(lovely,film|ham) P(lovely,film|ham) P(lovely) * P(film) = (1.1 / 30.4)2 (0.1 / 32.4)2 < (1.1 / 30.4)2
  62. 62. Smoothed n-grams with bayes P(A|B) = P(A)P(B|A) / P(B) P(spam|words) = P(spam)P(words|spam) / P (words) P(words) = P(spam)P(words|spam) + P(ham)P(words|ham) We'll take product of all of the word probabilities as P (words) for both spam and ham, and choose whichever has the highest P(tag|words).
  63. 63. Applied smoothing Email two (ham): P(bank,bank,bank|spam) 8.75e-04 0.39 39% P(bank,bank,bank|ham) 4.73e-05 0.02 2% P(lovely,film|spam) 9.52e-06 0.004 0.4% P(lovely,film|ham) 1.3e-03 0.58 58% Email one (spam): Priors: ham(0.5) spam(0.5) P(spam)P(words|spam) / P(words) P(ham)P(words|ham) / P(words)
  64. 64. Summary ● NLP uses features of language to statistically classify, interpret or generate language. ● Bayes rule is a mathematical method for updating your beliefs on evidence ● P(event|signal) = P(event)P(signal|event) / P (signal) ● Using smoothed n-grams is a dumb but simple spam filter ● Naïve Bayes shouldn't work: but does

×