Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- L05 word representation by ananth 754 views
- Natural Language Processing: L03 ma... by ananth 770 views
- An overview of Hidden Markov Models... by ananth 1535 views
- E mail image spam filtering techniques by ranjit banshpal 2684 views
- L05 language model_part2 by ananth 646 views
- Overview of TensorFlow For Natural ... by ananth 494 views

1,339 views

1,166 views

1,166 views

Published on

No Downloads

Total views

1,339

On SlideShare

0

From Embeds

0

Number of Embeds

13

Shares

0

Downloads

17

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Natural Language Processing with Naïve Bayes Tim Ruffles @timruffles
- 2. Overview ● Intro to Natural Language Processing ● Intro to Bayes ● Bayesian Maths ● Bayes applied to Natural Language Processing
- 3. NLP (not like Derren Brown)
- 4. Processing text ● Named entity recognition - Skimwords ● Information retrieval - Google ● Information extraction - IBM's Watson ● Interpreting - sentiment, named entities ● Classification - spam vs not spam ● Speech to text - Siri
- 5. Named entity recognition
- 6. Classification From: Prime Minister of Nigeria Subject: Opportunity Dear Sir, My country vexes me; I wish to leave. Please give me your bank account information for instantaneous enrichment, no danger to you! Yours in good faith and honour, Mr P. Minister From: sally@gmail.com Subject: cats lol this cat is really fat http://reddit.com/r/fat-cats/roflolcoptor- fat-cat-dancing.gif spam: 99% ham: 1% spam: 1% ham: 99%
- 7. Example Task
- 8. Identify Product References
- 9. How do humans do this? ● Algorithms are far dumber than you ● If you don't have enough info, an algorithm will not help ● Anyone can identify features required for natural language processing
- 10. Features The new cameras are the Canon PowerShot S100, the Nikon J1 and the Olympus PEN.
- 11. Types of features ● Word shape (capitalization, numbers etc) ● Tag context - near a product ● Dictionary/gazette - list of brands ● Part of speech - verb, noun ● n-Grams - products contain only one brand
- 12. NLP process
- 13. The new cameras are Canon's PowerShot S100, the Nikon J1 and the Olympus PEN.
- 14. The new cameras are Canon's PowerShot S100, the Nikon J1 and the Olympus PEN. Supervision
- 15. The new cameras are , the and the Canon's PowerShot S100 Nikon J1 Olympus PEN. Feature extraction
- 16. The new cameras are , the and the Canon's PowerShot S100 Nikon J1 Olympus PEN. Correlate features & tags capital in middle of sentence: 0 capital in middle of word: 0 acronyms: 0 words with numbers in them: 0 capital in middle of sentence: 7 capital in middle of word: 1 acronyms: 1 words with numbers in them: 2
- 17. NLP Overview Supervision with tagged data Training up a model Test model on test set Model ready to use
- 18. Nuts and bolts Supervision - create a test set of labelled data Normalisation and clean-up (Canon's -> Canon etc) Feature extraction and training on training set Validate on test set
- 19. How to use features/tags to tag products? ● Need to a method for using our correlated feature/tag sets to learn from and predict mathematically ● One such method is...
- 20. Naïve Bayes
- 21. When my information changes, I alter my conclusions. What do you do, sir? Keynes
- 22. Mathematically updating our beliefs on evidence
- 23. Bayes: local hero
- 24. Thomas Bayes
- 25. An Essay towards solving a Problem in the Doctrine of Chances 1763
- 26. Example applications ● Given a drug test result, how likely is it a person has taken drugs? ● Give these words, how likely is it that this email is spam? ● Given these words, how likely is it they refer to a product?
- 27. Estimate ● 99% accurate drug test ● 1% of people actually take drugs Given the above, what is the probability that someone indicated as drug positive by the test is a drug user?
- 28. Place your bets
- 29. 50%
- 30. The Maths
- 31. A Little Notation Probability 0 0.5 1 Impossible You'd never bet on it happening Likely as not - evens Best odds you'd get would be 1/2 Certain You'd never bet against
- 32. More notation P(spam) Probability of spam P(^spam) Probability of not spam P(spam|features) Probability of spam given some features
- 33. A few rules P(6,6) = P(6)P(6) = 1/6 x 1/6 = 1/36 Probability of rolling 6 twice P(^6) = 1 - P(6) = 1 - 1/6 = 5/6 Probability of not rolling six is inverse of rolling a six
- 34. Independence P(A,B) = P(A)P(B) Only applies if two events are independent. Events are independent if the one having happened has no bearing on how likely it is the other will.
- 35. Dependence is informative e.g: if someone is paler than normal, they could be sick P(sick|pale) ≠ P(sick) if someone fails a drug test, they could be a drug user
- 36. P(A|B)? What is the probability of A, given that B has happened?
- 37. Drugs test ● 99% accurate, 1% of people take drugs ● Prior probability that someone is a drug user: 1% ● 1% chance of a false positive Probability of something not happening is inverse of it happening.
- 38. Priors (pre information) Prior: drug use P(drug use) = 0.01 = 1/100 = 1% Prior: false positive P(false positive) = 0.01 = 1/100 = 1%
- 39. A drug test is asking P(drug user | positive drug test)
- 40. Union of a signal and an event P(drug user | positive drug test) P(event | signal)
- 41. We can see an signal at least 2 ways Can see a positive in 2 ways: P(drug user , positive drug test) P(non user , positive drug test) or a negative in two ways: P(drug user , negative drug test) P(non user , negative drug test)
- 42. The theorem The chance of an event given a signal is the ratio of: the prior probability of the event multiplied by that of seeing the signal given the event to all the ways you could see that signal.
- 43. The calculation P(drug user | positive drug test) = P(drug user) x P(positive drug test | drug user) P(positive drug test)
- 44. Estimate ● 99% accurate drug test ● 1% of people take drugs Given the above, what is the chance someone failing the drugs test is a drug user?
- 45. Place your bets
- 46. 50%
- 47. The calculation P(drug user | positive drug test) = 1/100 x 99/100 P(positive drug test)
- 48. P(B)?
- 49. P(B) ● All the ways you could see the signal ∑ P(event) x P(signal | event) (∑ is sum of, ie add all the things)
- 50. P(B) ● In our case there are two possibilities - person is either a drug user or not - we already know the result of the test P(user) x P(positive | user) P(A) x P(B|A) + + P(clean) x P(positive | clean) P(^A) x P(B|^A)
- 51. The calculation P(drug user | positive drug test) = 1/100 x 99/100 (1/100 x 99/100) + (99/100 x 1/100)
- 52. The calculation P(drug user | positive drug test) = 1 x 99 (1 x 99) + (99 x 1)
- 53. The calculation P(drug user | positive drug test) = 99 99(1 + 1)
- 54. The calculation P(drug user | positive drug test) = 1 2
- 55. Maths applied to NLP
- 56. Building a spam filter ● Using what we know about Bayes, we're going to build an NLP spam filter ● We'll use n-grams as our features - the number of times we have seen each word ● 1-gram is each word, 2-grams are pairs of words: 2-grams are more accurate but more complex
- 57. 1-grams Dear Sir, Give me your bank account. I will transfer money from my bank account to your bank account. Yours in good faith and honour, Mr P. Minister Hi, Lovely to see you last night. I'll pay you back for the film - just give me your bank account details. Cheers, Sally x bank 3 account 3 your 2 from 1 give 1 dear 1 sir 1 i 1 will 1 transfer 1 money 1 me 1 my 1 to 1 you 2 the 1 to 1 see 1 hi 1 last 1 night 1 ill 1 pay 1 back 1 for 1 me 1 your 1 bank 1 account 1 details 1 30 words total 28 words total
- 58. 1-grams bank 3 / 30 = 1/10 words is bank for spam Give me your bank account. I will transfer money from my bank account to your bank account. P(bank,bank,bank|spam) P(bank) * P(bank) * P(bank) = 1/1000 P(bank,bank,bank|ham) P(bank) * P(bank) * P(bank) = 1/21,952
- 59. Smoothing 1-grams 24 unique words Count each word as (count(word) + smooth) countWords + (smooth*uniqueWords) Laplacian smooth - take a bit of probability away from each of our words to give to words we've not seen before.
- 60. Smoothing 1-grams (count(word) + smooth) countWords + (smooth*uniqueWords) P(bank) = 1+0.1 / 28 + (0.1 * 24) P(bank) = 1.1 / 30.4 P(sesquipedalian) = 0.1 / 30.4
- 61. Applied smoothing P(lovely,film|spam) P(lovely) = 0 + 0.1 / 30 + (24 * 0.1) P(lovely) = 0.1 / 32.4 P(lovely) * P(film) = (0.1 / 32.4)2 P(lovely,film|ham) P(lovely,film|ham) P(lovely) * P(film) = (1.1 / 30.4)2 (0.1 / 32.4)2 < (1.1 / 30.4)2
- 62. Smoothed n-grams with bayes P(A|B) = P(A)P(B|A) / P(B) P(spam|words) = P(spam)P(words|spam) / P (words) P(words) = P(spam)P(words|spam) + P(ham)P(words|ham) We'll take product of all of the word probabilities as P (words) for both spam and ham, and choose whichever has the highest P(tag|words).
- 63. Applied smoothing Email two (ham): P(bank,bank,bank|spam) 8.75e-04 0.39 39% P(bank,bank,bank|ham) 4.73e-05 0.02 2% P(lovely,film|spam) 9.52e-06 0.004 0.4% P(lovely,film|ham) 1.3e-03 0.58 58% Email one (spam): Priors: ham(0.5) spam(0.5) P(spam)P(words|spam) / P(words) P(ham)P(words|ham) / P(words)
- 64. Summary ● NLP uses features of language to statistically classify, interpret or generate language. ● Bayes rule is a mathematical method for updating your beliefs on evidence ● P(event|signal) = P(event)P(signal|event) / P (signal) ● Using smoothed n-grams is a dumb but simple spam filter ● Naïve Bayes shouldn't work: but does

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment