Your SlideShare is downloading. ×
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…

112
views

Published on

Published in: Education, Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
112
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Aug 25th, 2001Copyright © 2001, Andrew W. MooreProbabilistic andBayesian AnalyticsAndrew W. MooreAssociate ProfessorSchool of Computer ScienceCarnegie Mellon Universitywww.cs.cmu.edu/~awmawm@cs.cmu.edu412-268-7599Note to other teachers and users ofthese slides. Andrew would be delightedif you found this source material useful ingiving your own lectures. Feel free to usethese slides verbatim, or to modify themto fit your own needs. PowerPointoriginals are available. If you make useof a significant portion of these slides inyour own lecture, please include thismessage, or the following link to thesource repository of Andrew’s tutorials:http://www.cs.cmu.edu/~awm/tutorials .Comments and corrections gratefullyreceived.
  • 2. Probabilistic Analytics: Slide 2Copyright © 2001, Andrew W. MooreProbability• The world is a very uncertain place• 30 years of Artificial Intelligence andDatabase research danced around this fact• And then a few AI researchers decided touse some ideas from the eighteenth century
  • 3. Probabilistic Analytics: Slide 3Copyright © 2001, Andrew W. MooreWhat we’re going to do• We will review the fundamentals ofprobability.• It’s really going to be worth it• In this lecture, you’ll see an example ofprobabilistic analytics in action: BayesClassifiers
  • 4. Probabilistic Analytics: Slide 4Copyright © 2001, Andrew W. MooreDiscrete Random Variables• A is a Boolean-valued random variable if Adenotes an event, and there is some degreeof uncertainty as to whether A occurs.• Examples• A = The US president in 2023 will be male• A = You wake up tomorrow with aheadache• A = You have Ebola
  • 5. Probabilistic Analytics: Slide 5Copyright © 2001, Andrew W. MooreProbabilities• We write P(A) as “the fraction of possibleworlds in which A is true”• We could at this point spend 2 hours on thephilosophy of this.• But we won’t.
  • 6. Probabilistic Analytics: Slide 6Copyright © 2001, Andrew W. MooreVisualizing AEvent space ofall possibleworldsIts area is 1Worlds in which A is FalseWorlds in whichA is trueP(A) = Area ofreddish oval
  • 7. Probabilistic Analytics: Slide 7Copyright © 2001, Andrew W. MooreThe Axioms of Probability• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)Where do these axioms come from? Were they “discovered”?Answers coming up later.
  • 8. Probabilistic Analytics: Slide 8Copyright © 2001, Andrew W. MooreInterpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)The area of A can’t getany smaller than 0And a zero area wouldmean no world couldever have A true
  • 9. Probabilistic Analytics: Slide 9Copyright © 2001, Andrew W. MooreInterpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)The area of A can’t getany bigger than 1And an area of 1 wouldmean all worlds will haveA true
  • 10. Probabilistic Analytics: Slide 10Copyright © 2001, Andrew W. MooreInterpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)AB
  • 11. Probabilistic Analytics: Slide 11Copyright © 2001, Andrew W. MooreInterpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)ABP(A or B)BP(A and B)Simple addition and subtraction
  • 12. Probabilistic Analytics: Slide 12Copyright © 2001, Andrew W. MooreThese Axioms are Not to beTrifled With• There have been attempts to do differentmethodologies for uncertainty• Fuzzy Logic• Three-valued logic• Dempster-Shafer• Non-monotonic reasoning• But the axioms of probability are the onlysystem with this property:If you gamble using them you can’t be unfairly exploitedby an opponent using some other system [di Finetti 1931]
  • 13. Probabilistic Analytics: Slide 13Copyright © 2001, Andrew W. MooreTheorems from the Axioms• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)From these we can prove:P(not A) = P(~A) = 1-P(A)• How?
  • 14. Probabilistic Analytics: Slide 14Copyright © 2001, Andrew W. MooreSide Note• I am inflicting these proofs on you for tworeasons:1. These kind of manipulations will need to besecond nature to you if you use probabilisticanalytics in depth2. Suffering is good for you
  • 15. Probabilistic Analytics: Slide 15Copyright © 2001, Andrew W. MooreAnother important theorem• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)From these we can prove:P(A) = P(A ^ B) + P(A ^ ~B)• How?
  • 16. Probabilistic Analytics: Slide 16Copyright © 2001, Andrew W. MooreMultivalued Random Variables• Suppose A can take on more than 2 values• A is a random variable with arity k if it cantake on exactly one value out of {v1,v2, .. vk}• Thus…jivAvAP ji  if0)(1)( 21  kvAvAvAP
  • 17. Probabilistic Analytics: Slide 17Copyright © 2001, Andrew W. MooreAn easy fact about MultivaluedRandom Variables:• Using the axioms of probability…0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)• And assuming that A obeys…• It’s easy to prove thatjivAvAP ji  if0)(1)( 21  kvAvAvAP)()(121 ijji vAPvAvAvAP
  • 18. Probabilistic Analytics: Slide 18Copyright © 2001, Andrew W. MooreAn easy fact about MultivaluedRandom Variables:• Using the axioms of probability…0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)• And assuming that A obeys…• It’s easy to prove thatjivAvAP ji  if0)(1)( 21  kvAvAvAP)()(121 ijji vAPvAvAvAP• And thus we can prove1)(1kjjvAP
  • 19. Probabilistic Analytics: Slide 19Copyright © 2001, Andrew W. MooreAnother fact about MultivaluedRandom Variables:• Using the axioms of probability…0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)• And assuming that A obeys…• It’s easy to prove thatjivAvAP ji  if0)(1)( 21  kvAvAvAP)(])[(121 ijji vABPvAvAvABP
  • 20. Probabilistic Analytics: Slide 20Copyright © 2001, Andrew W. MooreAnother fact about MultivaluedRandom Variables:• Using the axioms of probability…0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)• And assuming that A obeys…• It’s easy to prove thatjivAvAP ji  if0)(1)( 21  kvAvAvAP)(])[(121 ijji vABPvAvAvABP• And thus we can prove)()(1kjjvABPBP
  • 21. Probabilistic Analytics: Slide 21Copyright © 2001, Andrew W. MooreElementary Probability in Pictures• P(~A) + P(A) = 1
  • 22. Probabilistic Analytics: Slide 22Copyright © 2001, Andrew W. MooreElementary Probability in Pictures• P(B) = P(B ^ A) + P(B ^ ~A)
  • 23. Probabilistic Analytics: Slide 23Copyright © 2001, Andrew W. MooreElementary Probability in Pictures1)(1kjjvAP
  • 24. Probabilistic Analytics: Slide 24Copyright © 2001, Andrew W. MooreElementary Probability in Pictures)()(1kjjvABPBP
  • 25. Probabilistic Analytics: Slide 25Copyright © 2001, Andrew W. MooreConditional Probability• P(A|B) = Fraction of worlds in which B istrue that also have A trueFHH = “Have a headache”F = “Coming down with Flu”P(H) = 1/10P(F) = 1/40P(H|F) = 1/2“Headaches are rare and fluis rarer, but if you’re comingdown with ‘flu there’s a 50-50 chance you’ll have aheadache.”
  • 26. Probabilistic Analytics: Slide 26Copyright © 2001, Andrew W. MooreConditional ProbabilityFHH = “Have a headache”F = “Coming down with Flu”P(H) = 1/10P(F) = 1/40P(H|F) = 1/2P(H|F) = Fraction of flu-inflictedworlds in which you have aheadache= #worlds with flu and headache------------------------------------#worlds with flu= Area of “H and F” region------------------------------Area of “F” region= P(H ^ F)-----------P(F)
  • 27. Probabilistic Analytics: Slide 27Copyright © 2001, Andrew W. MooreDefinition of Conditional ProbabilityP(A ^ B)P(A|B) = -----------P(B)Corollary: The Chain RuleP(A ^ B) = P(A|B) P(B)
  • 28. Probabilistic Analytics: Slide 28Copyright © 2001, Andrew W. MooreProbabilistic InferenceFHH = “Have a headache”F = “Coming down with Flu”P(H) = 1/10P(F) = 1/40P(H|F) = 1/2One day you wake up with a headache. You think: “Drat!50% of flus are associated with headaches so I must have a50-50 chance of coming down with flu”Is this reasoning good?
  • 29. Probabilistic Analytics: Slide 29Copyright © 2001, Andrew W. MooreProbabilistic InferenceFHH = “Have a headache”F = “Coming down with Flu”P(H) = 1/10P(F) = 1/40P(H|F) = 1/2P(F ^ H) = …P(F|H) = …
  • 30. Probabilistic Analytics: Slide 30Copyright © 2001, Andrew W. MooreAnother way to understand theintuitionThanks to Jahanzeb Sherwani for contributing this explanation:
  • 31. Probabilistic Analytics: Slide 31Copyright © 2001, Andrew W. MooreWhat we just did…P(A ^ B) P(A|B) P(B)P(B|A) = ----------- = ---------------P(A) P(A)This is Bayes RuleBayes, Thomas (1763) An essaytowards solving a problem in thedoctrine of chances. PhilosophicalTransactions of the Royal Society ofLondon, 53:370-418
  • 32. Probabilistic Analytics: Slide 32Copyright © 2001, Andrew W. MooreUsing Bayes Rule to GambleThe “Win” envelopehas a dollar and fourbeads in it$1.00The “Lose” envelopehas three beads andno moneyTrivial question: someone draws an envelope at random and offers tosell it to you. How much should you pay?
  • 33. Probabilistic Analytics: Slide 33Copyright © 2001, Andrew W. MooreUsing Bayes Rule to GambleThe “Win” envelopehas a dollar and fourbeads in it$1.00The “Lose” envelopehas three beads andno moneyInteresting question: before deciding, you are allowed to see one beaddrawn from the envelope.Suppose it’s black: How much should you pay?Suppose it’s red: How much should you pay?
  • 34. Probabilistic Analytics: Slide 34Copyright © 2001, Andrew W. MooreCalculation…$1.00
  • 35. Probabilistic Analytics: Slide 35Copyright © 2001, Andrew W. MooreMore General Forms of Bayes Rule)(~)|~()()|()()|()|(APABPAPABPAPABPBAP)()()|()|(XBPXAPXABPXBAP
  • 36. Probabilistic Analytics: Slide 36Copyright © 2001, Andrew W. MooreMore General Forms of Bayes Rule AnkkkiiivAPvABPvAPvABPBvAP1)()|()()|()|(
  • 37. Probabilistic Analytics: Slide 37Copyright © 2001, Andrew W. MooreUseful Easy-to-prove facts1)|()|(  BAPBAP1)|(1Ankk BvAP
  • 38. Probabilistic Analytics: Slide 38Copyright © 2001, Andrew W. MooreThe Joint DistributionRecipe for making a joint distributionof M variables:Example: Booleanvariables A, B, C
  • 39. Probabilistic Analytics: Slide 39Copyright © 2001, Andrew W. MooreThe Joint DistributionRecipe for making a joint distributionof M variables:1. Make a truth table listing allcombinations of values of yourvariables (if there are M Booleanvariables then the table will have2M rows).Example: Booleanvariables A, B, CA B C0 0 00 0 10 1 00 1 11 0 01 0 11 1 01 1 1
  • 40. Probabilistic Analytics: Slide 40Copyright © 2001, Andrew W. MooreThe Joint DistributionRecipe for making a joint distributionof M variables:1. Make a truth table listing allcombinations of values of yourvariables (if there are M Booleanvariables then the table will have2M rows).2. For each combination of values,say how probable it is.Example: Booleanvariables A, B, CA B C Prob0 0 0 0.300 0 1 0.050 1 0 0.100 1 1 0.051 0 0 0.051 0 1 0.101 1 0 0.251 1 1 0.10
  • 41. Probabilistic Analytics: Slide 41Copyright © 2001, Andrew W. MooreThe Joint DistributionRecipe for making a joint distributionof M variables:1. Make a truth table listing allcombinations of values of yourvariables (if there are M Booleanvariables then the table will have2M rows).2. For each combination of values,say how probable it is.3. If you subscribe to the axioms ofprobability, those numbers mustsum to 1.Example: Booleanvariables A, B, CA B C Prob0 0 0 0.300 0 1 0.050 1 0 0.100 1 1 0.051 0 0 0.051 0 1 0.101 1 0 0.251 1 1 0.10ABC0.050.250.10 0.050.050.100.100.30
  • 42. Probabilistic Analytics: Slide 42Copyright © 2001, Andrew W. MooreUsing theJointOne you have the JD you canask for the probability of anylogical expression involvingyour attributeEPEPmatchingrows)row()(
  • 43. Probabilistic Analytics: Slide 43Copyright © 2001, Andrew W. MooreUsing theJointP(Poor Male) = 0.4654EPEPmatchingrows)row()(
  • 44. Probabilistic Analytics: Slide 44Copyright © 2001, Andrew W. MooreUsing theJointP(Poor) = 0.7604EPEPmatchingrows)row()(
  • 45. Probabilistic Analytics: Slide 45Copyright © 2001, Andrew W. MooreInferencewith theJoint221matchingrowsandmatchingrows22121)row()row()()()|(EEEPPEPEEPEEP
  • 46. Probabilistic Analytics: Slide 46Copyright © 2001, Andrew W. MooreInferencewith theJoint221matchingrowsandmatchingrows22121)row()row()()()|(EEEPPEPEEPEEPP(Male | Poor) = 0.4654 / 0.7604 = 0.612
  • 47. Probabilistic Analytics: Slide 47Copyright © 2001, Andrew W. MooreInference is a big deal• I’ve got this evidence. What’s the chancethat this conclusion is true?• I’ve got a sore neck: how likely am I to have meningitis?• I see my lights are out and it’s 9pm. What’s the chancemy spouse is already asleep?
  • 48. Probabilistic Analytics: Slide 48Copyright © 2001, Andrew W. MooreInference is a big deal• I’ve got this evidence. What’s the chancethat this conclusion is true?• I’ve got a sore neck: how likely am I to have meningitis?• I see my lights are out and it’s 9pm. What’s the chancemy spouse is already asleep?
  • 49. Probabilistic Analytics: Slide 49Copyright © 2001, Andrew W. MooreInference is a big deal• I’ve got this evidence. What’s the chancethat this conclusion is true?• I’ve got a sore neck: how likely am I to have meningitis?• I see my lights are out and it’s 9pm. What’s the chancemy spouse is already asleep?• There’s a thriving set of industries growing basedaround Bayesian Inference. Highlights are:Medicine, Pharma, Help Desk Support, EngineFault Diagnosis
  • 50. Probabilistic Analytics: Slide 50Copyright © 2001, Andrew W. MooreWhere do Joint Distributionscome from?• Idea One: Expert Humans• Idea Two: Simpler probabilistic facts andsome algebraExample: Suppose you knewP(A) = 0.7P(B|A) = 0.2P(B|~A) = 0.1P(C|A^B) = 0.1P(C|A^~B) = 0.8P(C|~A^B) = 0.3P(C|~A^~B) = 0.1Then you can automaticallycompute the JD using thechain ruleP(A=x ^ B=y ^ C=z) =P(C=z|A=x^ B=y) P(B=y|A=x) P(A=x)In another lecture:Bayes Nets, asystematic way todo this.
  • 51. Probabilistic Analytics: Slide 51Copyright © 2001, Andrew W. MooreWhere do Joint Distributionscome from?• Idea Three: Learn them from data!Prepare to see one of the most impressive learningalgorithms you’ll come across in the entire course….
  • 52. Probabilistic Analytics: Slide 52Copyright © 2001, Andrew W. MooreLearning a joint distributionBuild a JD table for yourattributes in which theprobabilities are unspecifiedThe fill in each row withrecordsofnumbertotalrowmatchingrecords)row(ˆ PA B C Prob0 0 0 ?0 0 1 ?0 1 0 ?0 1 1 ?1 0 0 ?1 0 1 ?1 1 0 ?1 1 1 ?A B C Prob0 0 0 0.300 0 1 0.050 1 0 0.100 1 1 0.051 0 0 0.051 0 1 0.101 1 0 0.251 1 1 0.10Fraction of all records in whichA and B are True but C is False
  • 53. Probabilistic Analytics: Slide 53Copyright © 2001, Andrew W. MooreExample of Learning a Joint• This Joint wasobtained bylearning fromthreeattributes inthe UCI “Adult”CensusDatabase[Kohavi 1995]
  • 54. Probabilistic Analytics: Slide 54Copyright © 2001, Andrew W. MooreWhere are we?• We have recalled the fundamentals ofprobability• We have become content with what JDs areand how to use them• And we even know how to learn JDs fromdata.
  • 55. Probabilistic Analytics: Slide 55Copyright © 2001, Andrew W. MooreDensity Estimation• Our Joint Distribution learner is our firstexample of something called DensityEstimation• A Density Estimator learns a mapping froma set of attributes to a ProbabilityDensityEstimatorProbabilityInputAttributes
  • 56. Probabilistic Analytics: Slide 56Copyright © 2001, Andrew W. MooreDensity Estimation• Compare it against the two other majorkinds of models:Regressor Prediction ofreal-valued outputInputAttributesDensityEstimatorProbabilityInputAttributesClassifier Prediction ofcategorical outputInputAttributes
  • 57. Probabilistic Analytics: Slide 57Copyright © 2001, Andrew W. MooreEvaluating Density EstimationRegressor Prediction ofreal-valued outputInputAttributesDensityEstimatorProbabilityInputAttributesClassifier Prediction ofcategorical outputInputAttributesTest setAccuracy?Test setAccuracyTest-set criterion for estimating performanceon future data** See the Decision Tree or Cross Validation lecture for more detail
  • 58. Probabilistic Analytics: Slide 58Copyright © 2001, Andrew W. Moore• Given a record x, a density estimator M cantell you how likely the record is:• Given a dataset with R records, a densityestimator can tell you how likely the datasetis:(Under the assumption that all records were independentlygenerated from the Density Estimator’s JD)Evaluating a density estimatorRkkR |MP|MP|MP121 )(ˆ)(ˆ)dataset(ˆ xxxx )(ˆ |MP x
  • 59. Probabilistic Analytics: Slide 59Copyright © 2001, Andrew W. MooreA small dataset: Miles Per GallonFrom the UCI repository (thanks to Ross Quinlan)192TrainingSetRecordsmpg modelyear makergood 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
  • 60. Probabilistic Analytics: Slide 60Copyright © 2001, Andrew W. MooreA small dataset: Miles Per Gallon192TrainingSetRecordsmpg modelyear makergood 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
  • 61. Probabilistic Analytics: Slide 61Copyright © 2001, Andrew W. MooreA small dataset: Miles Per Gallon192TrainingSetRecordsmpg modelyear makergood 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe203-121103.4case)(in this)(ˆ)(ˆ)dataset(ˆ RkkR |MP|MP|MP xxxx 
  • 62. Probabilistic Analytics: Slide 62Copyright © 2001, Andrew W. MooreLog ProbabilitiesSince probabilities of datasets get sosmall we usually use log probabilities RkkRkk |MP|MP|MP11)(ˆlog)(ˆlog)dataset(ˆlog xx
  • 63. Probabilistic Analytics: Slide 63Copyright © 2001, Andrew W. MooreA small dataset: Miles Per Gallon192TrainingSetRecordsmpg modelyear makergood 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe466.19case)(in this)(ˆlog)(ˆlog)dataset(ˆlog11  RkkRkk |MP|MP|MP xx
  • 64. Probabilistic Analytics: Slide 64Copyright © 2001, Andrew W. MooreSummary: The Good News• We have a way to learn a Density Estimatorfrom data.• Density estimators can do many goodthings…• Can sort the records by probability, and thusspot weird records (anomaly detection)• Can do inference: P(E1|E2)Automatic Doctor / Help Desk etc• Ingredient for Bayes Classifiers (see later)
  • 65. Probabilistic Analytics: Slide 65Copyright © 2001, Andrew W. MooreSummary: The Bad News• Density estimation by directly learning thejoint is trivial, mindless and dangerous
  • 66. Probabilistic Analytics: Slide 66Copyright © 2001, Andrew W. MooreUsing a test setAn independent test set with 196 cars has a worse log likelihood(actually it’s a billion quintillion quintillion quintillion quintilliontimes less likely)….Density estimators can overfit. And the full joint densityestimator is the overfittiest of them all!
  • 67. Probabilistic Analytics: Slide 67Copyright © 2001, Andrew W. MooreOverfitting Density EstimatorsIf this ever happens, it meansthere are certain combinationsthat we learn are impossible0)(ˆanyforif)(ˆlog)(ˆlog)testset(ˆlog11  |MPk|MP|MP|MPkRkkRkkxxx
  • 68. Probabilistic Analytics: Slide 68Copyright © 2001, Andrew W. MooreUsing a test setThe only reason that our test set didn’t score -infinity is that mycode is hard-wired to always predict a probability of at least onein 1020We need Density Estimators that are less proneto overfitting
  • 69. Probabilistic Analytics: Slide 69Copyright © 2001, Andrew W. MooreNaïve Density EstimationThe problem with the Joint Estimator is that it justmirrors the training data.We need something which generalizes more usefully.The naïve model generalizes strongly:Assume that each attribute is distributedindependently of any of the other attributes.
  • 70. Probabilistic Analytics: Slide 70Copyright © 2001, Andrew W. MooreIndependently Distributed Data• Let x[i] denote the i’th field of record x.• The independently distributed assumptionsays that for any i,v, u1 u2… ui-1 ui+1… uM)][()][,]1[,]1[,]2[,]1[|][( 1121vixPuMxuixuixuxuxvixP Mii  • Or in other words, x[i] is independent of{x[1],x[2],..x[i-1], x[i+1],…x[M]}• This is often written as]}[],1[],1[],2[],1[{][ Mxixixxxix  
  • 71. Probabilistic Analytics: Slide 71Copyright © 2001, Andrew W. MooreA note about independence• Assume A and B are Boolean RandomVariables. Then“A and B are independent”if and only ifP(A|B) = P(A)• “A and B are independent” is often notatedasBA 
  • 72. Probabilistic Analytics: Slide 72Copyright © 2001, Andrew W. MooreIndependence Theorems• Assume P(A|B) = P(A)• Then P(A^B) == P(A) P(B)• Assume P(A|B) = P(A)• Then P(B|A) == P(B)
  • 73. Probabilistic Analytics: Slide 73Copyright © 2001, Andrew W. MooreIndependence Theorems• Assume P(A|B) = P(A)• Then P(~A|B) == P(~A)• Assume P(A|B) = P(A)• Then P(A|~B) == P(A)
  • 74. Probabilistic Analytics: Slide 74Copyright © 2001, Andrew W. MooreMultivalued IndependenceFor multivalued Random Variables A and B,BA if and only if)()|(:, uAPvBuAPvu from which you can then prove things like…)()()(:, vBPuAPvBuAPvu )()|(:, vBPvAvBPvu 
  • 75. Probabilistic Analytics: Slide 75Copyright © 2001, Andrew W. MooreBack to Naïve Density Estimation• Let x[i] denote the i’th field of record x:• Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}• Example:• Suppose that each record is generated by randomly shaking a green diceand a red dice• Dataset 1: A = red value, B = green value• Dataset 2: A = red value, B = sum of values• Dataset 3: A = sum of values, B = difference of values• Which of these datasets violates the naïve assumption?
  • 76. Probabilistic Analytics: Slide 76Copyright © 2001, Andrew W. MooreUsing the Naïve Distribution• Once you have a Naïve Distribution you can easilycompute any row of the joint distribution.• Suppose A, B, C and D are independentlydistributed. What is P(A^~B^C^~D)?
  • 77. Probabilistic Analytics: Slide 77Copyright © 2001, Andrew W. MooreUsing the Naïve Distribution• Once you have a Naïve Distribution you can easilycompute any row of the joint distribution.• Suppose A, B, C and D are independentlydistributed. What is P(A^~B^C^~D)?= P(A|~B^C^~D) P(~B^C^~D)= P(A) P(~B^C^~D)= P(A) P(~B|C^~D) P(C^~D)= P(A) P(~B) P(C^~D)= P(A) P(~B) P(C|~D) P(~D)= P(A) P(~B) P(C) P(~D)
  • 78. Probabilistic Analytics: Slide 78Copyright © 2001, Andrew W. MooreNaïve Distribution General Case• Suppose x[1], x[2], … x[M] are independentlydistributed.MkkM ukxPuMxuxuxP121 )][()][,]2[,]1[( • So if we have a Naïve Distribution we canconstruct any row of the implied Joint Distributionon demand.• So we can do any inference• But how do we learn a Naïve Density Estimator?
  • 79. Probabilistic Analytics: Slide 79Copyright © 2001, Andrew W. MooreLearning a Naïve DensityEstimatorrecordsofnumbertotal][in whichrecords#)][(ˆ uixuixPAnother trivial learning algorithm!
  • 80. Probabilistic Analytics: Slide 80Copyright © 2001, Andrew W. MooreContrastJoint DE Naïve DECan model anything Can model only veryboring distributionsNo problem to model “Cis a noisy copy of A”Outside Naïve’s scopeGiven 100 records and more than 6Boolean attributes will screw upbadlyGiven 100 records and 10,000multivalued attributes will be fine
  • 81. Probabilistic Analytics: Slide 81Copyright © 2001, Andrew W. MooreEmpirical Results: “Hopeless”The “hopeless” dataset consists of 40,000 records and 21 Booleanattributes called a,b,c, … u. Each attribute in each record is generated50-50 randomly as 0 or 1.Despite the vast amount of data, “Joint” overfits hopelessly anddoes much worseAverage test set logprobability during10 folds of k-foldcross-validation*Described in a future Andrew lecture
  • 82. Probabilistic Analytics: Slide 82Copyright © 2001, Andrew W. MooreEmpirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 Booleanattributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0or 1. D = A^~C, except that in 10% of records it is flippedThe DElearned by“Joint”The DElearned by“Naive”
  • 83. Probabilistic Analytics: Slide 83Copyright © 2001, Andrew W. MooreEmpirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 Booleanattributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0or 1. D = A^~C, except that in 10% of records it is flippedThe DElearned by“Joint”The DElearned by“Naive”
  • 84. Probabilistic Analytics: Slide 84Copyright © 2001, Andrew W. MooreA tiny part ofthe DElearned by“Joint”Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributesThe DElearned by“Naive”
  • 85. Probabilistic Analytics: Slide 85Copyright © 2001, Andrew W. MooreA tiny part ofthe DElearned by“Joint”Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributesThe DElearned by“Naive”
  • 86. Probabilistic Analytics: Slide 86Copyright © 2001, Andrew W. MooreThe DElearned by“Joint”Empirical Results: “Weight vs. MPG”Suppose we train only from the “Weight” and “MPG” attributesThe DElearned by“Naive”
  • 87. Probabilistic Analytics: Slide 87Copyright © 2001, Andrew W. MooreThe DElearned by“Joint”Empirical Results: “Weight vs. MPG”Suppose we train only from the “Weight” and “MPG” attributesThe DElearned by“Naive”
  • 88. Probabilistic Analytics: Slide 88Copyright © 2001, Andrew W. MooreThe DElearned by“Joint”“Weight vs. MPG”: The best that Naïve can doThe DElearned by“Naive”
  • 89. Probabilistic Analytics: Slide 89Copyright © 2001, Andrew W. MooreReminder: The Good News• We have two ways to learn a DensityEstimator from data.• *In other lectures we’ll see vastly moreimpressive Density Estimators (Mixture Models,Bayesian Networks, Density Trees, Kernel Densities and many more)• Density estimators can do many goodthings…• Anomaly detection• Can do inference: P(E1|E2) Automatic Doctor / Help Desk etc• Ingredient for Bayes Classifiers
  • 90. Probabilistic Analytics: Slide 90Copyright © 2001, Andrew W. MooreBayes Classifiers• A formidable and sworn enemy of decisiontreesClassifier Prediction ofcategorical outputInputAttributesDT BC
  • 91. Probabilistic Analytics: Slide 91Copyright © 2001, Andrew W. MooreHow to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and valuesv1, v2, … vny.• Assume there are m input attributes called X1, X2, … Xm• Break dataset into nY smaller datasets called DS1, DS2, … DSny.• Define DSi = Records in which Y=vi• For each DSi , learn Density Estimator Mi to model the inputdistribution among the Y=vi records.
  • 92. Probabilistic Analytics: Slide 92Copyright © 2001, Andrew W. MooreHow to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and valuesv1, v2, … vny.• Assume there are m input attributes called X1, X2, … Xm• Break dataset into nY smaller datasets called DS1, DS2, … DSny.• Define DSi = Records in which Y=vi• For each DSi , learn Density Estimator Mi to model the inputdistribution among the Y=vi records.• Mi estimates P(X1, X2, … Xm | Y=vi )
  • 93. Probabilistic Analytics: Slide 93Copyright © 2001, Andrew W. MooreHow to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and valuesv1, v2, … vny.• Assume there are m input attributes called X1, X2, … Xm• Break dataset into nY smaller datasets called DS1, DS2, … DSny.• Define DSi = Records in which Y=vi• For each DSi , learn Density Estimator Mi to model the inputdistribution among the Y=vi records.• Mi estimates P(X1, X2, … Xm | Y=vi )• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm= um) come along to be evaluated predict the value of Y thatmakes P(X1, X2, … Xm | Y=vi ) most likely)|(argmax 11predictvYuXuXPY mmv Is this a good idea?
  • 94. Probabilistic Analytics: Slide 94Copyright © 2001, Andrew W. MooreHow to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and valuesv1, v2, … vny.• Assume there are m input attributes called X1, X2, … Xm• Break dataset into nY smaller datasets called DS1, DS2, … DSny.• Define DSi = Records in which Y=vi• For each DSi , learn Density Estimator Mi to model the inputdistribution among the Y=vi records.• Mi estimates P(X1, X2, … Xm | Y=vi )• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm= um) come along to be evaluated predict the value of Y thatmakes P(X1, X2, … Xm | Y=vi ) most likely)|(argmax 11predictvYuXuXPY mmv Is this a good idea?This is a Maximum Likelihoodclassifier.It can get silly if some Ys arevery unlikely
  • 95. Probabilistic Analytics: Slide 95Copyright © 2001, Andrew W. MooreHow to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and valuesv1, v2, … vny.• Assume there are m input attributes called X1, X2, … Xm• Break dataset into nY smaller datasets called DS1, DS2, … DSny.• Define DSi = Records in which Y=vi• For each DSi , learn Density Estimator Mi to model the inputdistribution among the Y=vi records.• Mi estimates P(X1, X2, … Xm | Y=vi )• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm= um) come along to be evaluated predict the value of Y thatmakes P(Y=vi | X1, X2, … Xm) most likely)|(argmax 11predictmmvuXuXvYPY  Is this a good idea?Much Better Idea
  • 96. Probabilistic Analytics: Slide 96Copyright © 2001, Andrew W. MooreTerminology• MLE (Maximum Likelihood Estimator):• MAP (Maximum A-Posteriori Estimator):)|(argmax 11predictmmvuXuXvYPY  )|(argmax 11predictvYuXuXPY mmv 
  • 97. Probabilistic Analytics: Slide 97Copyright © 2001, Andrew W. MooreGetting what we need)|(argmax 11predictmmvuXuXvYPY  
  • 98. Probabilistic Analytics: Slide 98Copyright © 2001, Andrew W. MooreGetting a posterior probabilityYnjjjmmmmmmmmmmvYPvYuXuXPvYPvYuXuXPuXuXPvYPvYuXuXPuXuXvYP11111111111)()|()()|()()()|()|(
  • 99. Probabilistic Analytics: Slide 99Copyright © 2001, Andrew W. MooreBayes Classifiers in a nutshell)()|(argmax)|(argmax1111predictvYPvYuXuXPuXuXvYPYmmvmmv1. Learn the distribution over inputs for each value Y.2. This gives P(X1, X2, … Xm | Y=vi ).3. Estimate P(Y=vi ). as fraction of records with Y=vi .4. For a new prediction:
  • 100. Probabilistic Analytics: Slide 100Copyright © 2001, Andrew W. MooreBayes Classifiers in a nutshell)()|(argmax)|(argmax1111predictvYPvYuXuXPuXuXvYPYmmvmmv1. Learn the distribution over inputs for each value Y.2. This gives P(X1, X2, … Xm | Y=vi ).3. Estimate P(Y=vi ). as fraction of records with Y=vi .4. For a new prediction: We can use our favoriteDensity Estimator here.Right now we have twooptions:•Joint Density Estimator•Naïve Density Estimator
  • 101. Probabilistic Analytics: Slide 101Copyright © 2001, Andrew W. MooreJoint Density Bayes Classifier)()|(argmax 11predictvYPvYuXuXPY mmv In the case of the joint Bayes Classifier thisdegenerates to a very simple rule:Ypredict = the most common value of Y among recordsin which X1 = u1, X2 = u2, …. Xm = um.Note that if no records have the exact set of inputs X1= u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi )= 0 for all values of Y.In that case we just have to guess Y’s value
  • 102. Probabilistic Analytics: Slide 102Copyright © 2001, Andrew W. MooreJoint BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 Booleanattributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0or 1. D = A^~C, except that in 10% of records it is flippedThe Classifierlearned by“Joint BC”
  • 103. Probabilistic Analytics: Slide 103Copyright © 2001, Andrew W. MooreJoint BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 Booleanattributes called a,b,c,d..o where a,b,c are generated 50-50 randomlyas 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25
  • 104. Probabilistic Analytics: Slide 104Copyright © 2001, Andrew W. MooreNaïve Bayes Classifier)()|(argmax 11predictvYPvYuXuXPY mmv In the case of the naive Bayes Classifier this can besimplified:YnjjjvvYuXPvYPY1predict)|()(argmax
  • 105. Probabilistic Analytics: Slide 105Copyright © 2001, Andrew W. MooreNaïve Bayes Classifier)()|(argmax 11predictvYPvYuXuXPY mmv In the case of the naive Bayes Classifier this can besimplified:YnjjjvvYuXPvYPY1predict)|()(argmaxTechnical Hint:If you have 10,000 input attributes that product willunderflow in floating point math. You should use logs: YnjjjvvYuXPvYPY1predict)|(log)(logargmax
  • 106. Probabilistic Analytics: Slide 106Copyright © 2001, Andrew W. MooreBC Results: “XOR”The “XOR” dataset consists of 40,000 records and 2 Boolean inputs called aand b, generated 50-50 randomly as 0 or 1. c (output) = a XOR bThe Classifierlearned by“Naive BC”The Classifierlearned by“Joint BC”
  • 107. Probabilistic Analytics: Slide 107Copyright © 2001, Andrew W. MooreNaive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 Booleanattributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0or 1. D = A^~C, except that in 10% of records it is flippedThe Classifierlearned by“Naive BC”
  • 108. Probabilistic Analytics: Slide 108Copyright © 2001, Andrew W. MooreNaive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 Booleanattributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0or 1. D = A^~C, except that in 10% of records it is flippedThe Classifierlearned by“Joint BC”This result surprised Andrew until hehad thought about it a little
  • 109. Probabilistic Analytics: Slide 109Copyright © 2001, Andrew W. MooreNaïve BC Results: “All Irrelevant”The “all irrelevant” dataset consistsof 40,000 records and 15 Booleanattributes called a,b,c,d..o wherea,b,c are generated 50-50 randomlyas 0 or 1. v (output) = 1 withprobability 0.75, 0 with prob 0.25The Classifierlearned by“Naive BC”
  • 110. Probabilistic Analytics: Slide 110Copyright © 2001, Andrew W. MooreBC Results:“MPG”: 392recordsThe Classifierlearned by“Naive BC”
  • 111. Probabilistic Analytics: Slide 111Copyright © 2001, Andrew W. MooreBC Results:“MPG”: 40records
  • 112. Probabilistic Analytics: Slide 112Copyright © 2001, Andrew W. MooreMore Facts About BayesClassifiers• Many other density estimators can be slotted in*.• Density estimation can be performed with real-valuedinputs*• Bayes Classifiers can be built with real-valued inputs*• Rather Technical Complaint: Bayes Classifiers don’t try tobe maximally discriminative---they merely try to honestlymodel what’s going on*• Zero probabilities are painful for Joint and Naïve. A hack(justifiable with the magic words “Dirichlet Prior”) canhelp*.• Naïve Bayes is wonderfully cheap. And survives 10,000attributes cheerfully!*See future Andrew Lectures
  • 113. Probabilistic Analytics: Slide 113Copyright © 2001, Andrew W. MooreWhat you should know• Probability• Fundamentals of Probability and Bayes Rule• What’s a Joint Distribution• How to do inference (i.e. P(E1|E2)) once youhave a JD• Density Estimation• What is DE and what is it good for• How to learn a Joint DE• How to learn a naïve DE
  • 114. Probabilistic Analytics: Slide 114Copyright © 2001, Andrew W. MooreWhat you should know• Bayes Classifiers• How to build one• How to predict with a BC• Contrast between naïve and joint BCs
  • 115. Probabilistic Analytics: Slide 115Copyright © 2001, Andrew W. MooreInteresting Questions• Suppose you were evaluating NaiveBC,JointBC, and Decision Trees• Invent a problem where only NaiveBC would do well• Invent a problem where only Dtree would do well• Invent a problem where only JointBC would do well• Invent a problem where only NaiveBC would do poorly• Invent a problem where only Dtree would do poorly• Invent a problem where only JointBC would do poorly