2013-1 Machine Learning Lecture 03 - Andrew Moore - bayes nets for represe…
Upcoming SlideShare
Loading in...5
×
 

2013-1 Machine Learning Lecture 03 - Andrew Moore - bayes nets for represe…

on

  • 303 views

 

Statistics

Views

Total Views
303
Slideshare-icon Views on SlideShare
303
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    2013-1 Machine Learning Lecture 03 - Andrew Moore - bayes nets for represe… 2013-1 Machine Learning Lecture 03 - Andrew Moore - bayes nets for represe… Presentation Transcript

    • Oct 15th, 2001Copyright © 2001, Andrew W. MooreBayes Nets for representingand reasoning aboutuncertaintyAndrew W. MooreAssociate ProfessorSchool of Computer ScienceCarnegie Mellon Universitywww.cs.cmu.edu/~awmawm@cs.cmu.edu412-268-7599Note to other teachers and users ofthese slides. Andrew would be delightedif you found this source material useful ingiving your own lectures. Feel free to usethese slides verbatim, or to modify themto fit your own needs. PowerPointoriginals are available. If you make useof a significant portion of these slides inyour own lecture, please include thismessage, or the following link to thesource repository of Andrew’s tutorials:http://www.cs.cmu.edu/~awm/tutorials .Comments and corrections gratefullyreceived.
    • Bayes Nets: Slide 2Copyright © 2001, Andrew W. MooreWhat we’ll discuss• Recall the numerous and dramatic benefitsof Joint Distributions for describing uncertainworlds• Reel with terror at the problem with usingJoint Distributions• Discover how Bayes Net methodologyallows us to built Joint Distributions inmanageable chunks• Discover there’s still a lurking problem…• …Start to solve that problem
    • Bayes Nets: Slide 3Copyright © 2001, Andrew W. MooreWhy this matters• In Andrew’s opinion, the most importanttechnology in the Machine Learning / AI fieldto have emerged in the last 10 years.• A clean, clear, manageable language andmethodology for expressing what you’recertain and uncertain about• Already, many practical applications inmedicine, factories, helpdesks:P(this problem | these symptoms)anomalousness of this observationchoosing next diagnostic test | these observations
    • Bayes Nets: Slide 4Copyright © 2001, Andrew W. MooreWhy this matters• In Andrew’s opinion, the most importanttechnology in the Machine Learning / AI fieldto have emerged in the last 10 years.• A clean, clear, manageable language andmethodology for expressing what you’recertain and uncertain about• Already, many practical applications inmedicine, factories, helpdesks:P(this problem | these symptoms)anomalousness of this observationchoosing next diagnostic test | these observationsAnomalyDetectionInferenceActive DataCollection
    • Bayes Nets: Slide 5Copyright © 2001, Andrew W. MooreWays to deal with Uncertainty• Three-valued logic: True / False / Maybe• Fuzzy logic (truth values between 0 and 1)• Non-monotonic reasoning (especiallyfocused on Penguin informatics)• Dempster-Shafer theory (and an extensionknown as quasi-Bayesian theory)• Possibabilistic Logic• Probability
    • Bayes Nets: Slide 6Copyright © 2001, Andrew W. MooreDiscrete Random Variables• A is a Boolean-valued random variable if Adenotes an event, and there is some degreeof uncertainty as to whether A occurs.• Examples• A = The US president in 2023 will be male• A = You wake up tomorrow with a headache• A = You have Ebola
    • Bayes Nets: Slide 7Copyright © 2001, Andrew W. MooreProbabilities• We write P(A) as “the fraction of possibleworlds in which A is true”• We could at this point spend 2 hours on thephilosophy of this.• But we won’t.
    • Bayes Nets: Slide 8Copyright © 2001, Andrew W. MooreVisualizing AEvent space ofall possibleworldsIts area is 1Worlds in which A is FalseWorlds in whichA is trueP(A) = Area ofreddish oval
    • Bayes Nets: Slide 9Copyright © 2001, Andrew W. MooreInterpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)The area of A can’t getany smaller than 0And a zero area wouldmean no world couldever have A true
    • Bayes Nets: Slide 10Copyright © 2001, Andrew W. MooreInterpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)The area of A can’t getany bigger than 1And an area of 1 wouldmean all worlds will haveA true
    • Bayes Nets: Slide 11Copyright © 2001, Andrew W. MooreInterpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)AB
    • Bayes Nets: Slide 12Copyright © 2001, Andrew W. MooreInterpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)ABP(A or B)BP(A and B)Simple addition and subtraction
    • Bayes Nets: Slide 13Copyright © 2001, Andrew W. MooreThese Axioms are Not to beTrifled With• There have been attempts to do differentmethodologies for uncertainty• Fuzzy Logic• Three-valued logic• Dempster-Shafer• Non-monotonic reasoning• But the axioms of probability are the onlysystem with this property:If you gamble using them you can’t be unfairly exploited byan opponent using some other system [di Finetti 1931]
    • Bayes Nets: Slide 14Copyright © 2001, Andrew W. MooreTheorems from the Axioms• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)From these we can prove:P(not A) = P(~A) = 1-P(A)• How?
    • Bayes Nets: Slide 15Copyright © 2001, Andrew W. MooreSide Note• I am inflicting these proofs on you for tworeasons:1. These kind of manipulations will need to besecond nature to you if you use probabilisticanalytics in depth2. Suffering is good for you
    • Bayes Nets: Slide 16Copyright © 2001, Andrew W. MooreAnother important theorem• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)From these we can prove:P(A) = P(A ^ B) + P(A ^ ~B)• How?
    • Bayes Nets: Slide 17Copyright © 2001, Andrew W. MooreConditional Probability• P(A|B) = Fraction of worlds in which B is truethat also have A trueFHH = “Have a headache”F = “Coming down with Flu”P(H) = 1/10P(F) = 1/40P(H|F) = 1/2“Headaches are rare and fluis rarer, but if you’re comingdown with ‘flu there’s a 50-50 chance you’ll have aheadache.”
    • Bayes Nets: Slide 18Copyright © 2001, Andrew W. MooreConditional ProbabilityFHH = “Have a headache”F = “Coming down with Flu”P(H) = 1/10P(F) = 1/40P(H|F) = 1/2P(H|F) = Fraction of flu-inflictedworlds in which you have aheadache= #worlds with flu and headache------------------------------------#worlds with flu= Area of “H and F” region------------------------------Area of “F” region= P(H ^ F)-----------P(F)
    • Bayes Nets: Slide 19Copyright © 2001, Andrew W. MooreDefinition of Conditional ProbabilityP(A ^ B)P(A|B) = -----------P(B)Corollary: The Chain RuleP(A ^ B) = P(A|B) P(B)
    • Bayes Nets: Slide 20Copyright © 2001, Andrew W. MooreBayes RuleP(A ^ B) P(A|B) P(B)P(B|A) = ----------- = ---------------P(A) P(A)This is Bayes RuleBayes, Thomas (1763) An essaytowards solving a problem in the doctrineof chances. Philosophical Transactionsof the Royal Society of London, 53:370-418
    • Bayes Nets: Slide 21Copyright © 2001, Andrew W. MooreUsing Bayes Rule to GambleThe “Win” envelopehas a dollar and fourbeads in it$1.00The “Lose” envelopehas three beads andno moneyTrivial question: someone draws an envelope at random and offers tosell it to you. How much should you pay?R R B B R B B
    • Bayes Nets: Slide 22Copyright © 2001, Andrew W. MooreUsing Bayes Rule to GambleThe “Win” envelopehas a dollar and fourbeads in it$1.00The “Lose” envelopehas three beads andno moneyInteresting question: before deciding, you are allowed to see one beaddrawn from the envelope.Suppose it’s black: How much should you pay?Suppose it’s red: How much should you pay?
    • Bayes Nets: Slide 23Copyright © 2001, Andrew W. MooreCalculation…$1.00
    • Bayes Nets: Slide 24Copyright © 2001, Andrew W. MooreMultivalued Random Variables• Suppose A can take on more than 2 values• A is a random variable with arity k if it cantake on exactly one value out of {v1,v2, .. vk}• Thus…jivAvAP ji  if0)(1)( 21  kvAvAvAP
    • Bayes Nets: Slide 25Copyright © 2001, Andrew W. MooreAn easy fact about Multivalued Random Variables:• Using the axioms of probability…0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)• And assuming that A obeys…• It’s easy to prove thatjivAvAP ji  if0)(1)( 21  kvAvAvAP)()(121 ijji vAPvAvAvAP
    • Bayes Nets: Slide 26Copyright © 2001, Andrew W. MooreAn easy fact about Multivalued Random Variables:• Using the axioms of probability…0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)• And assuming that A obeys…• It’s easy to prove thatjivAvAP ji  if0)(1)( 21  kvAvAvAP)()(121 ijji vAPvAvAvAP• And thus we can prove1)(1kjjvAP
    • Bayes Nets: Slide 27Copyright © 2001, Andrew W. MooreAnother fact about Multivalued Random Variables:• Using the axioms of probability…0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)• And assuming that A obeys…• It’s easy to prove thatjivAvAP ji  if0)(1)( 21  kvAvAvAP)(])[(121 ijji vABPvAvAvABP
    • Bayes Nets: Slide 28Copyright © 2001, Andrew W. MooreAnother fact about Multivalued Random Variables:• Using the axioms of probability…0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)• And assuming that A obeys…• It’s easy to prove thatjivAvAP ji  if0)(1)( 21  kvAvAvAP)(])[(121 ijji vABPvAvAvABP• And thus we can prove)()(1kjjvABPBP
    • Bayes Nets: Slide 29Copyright © 2001, Andrew W. MooreMore General Forms of Bayes Rule)(~)|~()()|()()|()|(APABPAPABPAPABPBAP)()()|()|(XBPXAPXABPXBAP
    • Bayes Nets: Slide 30Copyright © 2001, Andrew W. MooreMore General Forms of Bayes Rule AnkkkiiivAPvABPvAPvABPBvAP1)()|()()|()|(
    • Bayes Nets: Slide 31Copyright © 2001, Andrew W. MooreUseful Easy-to-prove facts1)|()|(  BAPBAP1)|(1Ankk BvAP
    • Bayes Nets: Slide 32Copyright © 2001, Andrew W. MooreThe Joint DistributionRecipe for making a joint distributionof M variables:Example: Booleanvariables A, B, C
    • Bayes Nets: Slide 33Copyright © 2001, Andrew W. MooreThe Joint DistributionRecipe for making a joint distributionof M variables:1. Make a truth table listing allcombinations of values of yourvariables (if there are M Booleanvariables then the table will have2M rows).Example: Booleanvariables A, B, CA B C0 0 00 0 10 1 00 1 11 0 01 0 11 1 01 1 1
    • Bayes Nets: Slide 34Copyright © 2001, Andrew W. MooreThe Joint DistributionRecipe for making a joint distributionof M variables:1. Make a truth table listing allcombinations of values of yourvariables (if there are M Booleanvariables then the table will have2M rows).2. For each combination of values,say how probable it is.Example: Booleanvariables A, B, CA B C Prob0 0 0 0.300 0 1 0.050 1 0 0.100 1 1 0.051 0 0 0.051 0 1 0.101 1 0 0.251 1 1 0.10
    • Bayes Nets: Slide 35Copyright © 2001, Andrew W. MooreThe Joint DistributionRecipe for making a joint distributionof M variables:1. Make a truth table listing allcombinations of values of yourvariables (if there are M Booleanvariables then the table will have2M rows).2. For each combination of values,say how probable it is.3. If you subscribe to the axioms ofprobability, those numbers mustsum to 1.Example: Booleanvariables A, B, CA B C Prob0 0 0 0.300 0 1 0.050 1 0 0.100 1 1 0.051 0 0 0.051 0 1 0.101 1 0 0.251 1 1 0.10ABC0.050.250.10 0.050.050.100.100.30
    • Bayes Nets: Slide 36Copyright © 2001, Andrew W. MooreUsing theJointOnce you have the JD youcan ask for the probability ofany logical expressioninvolving your attributeEPEPmatchingrows)row()(
    • Bayes Nets: Slide 37Copyright © 2001, Andrew W. MooreUsing theJointP(Poor Male) = 0.4654EPEPmatchingrows)row()(
    • Bayes Nets: Slide 38Copyright © 2001, Andrew W. MooreUsing theJointP(Poor) = 0.7604EPEPmatchingrows)row()(
    • Bayes Nets: Slide 39Copyright © 2001, Andrew W. MooreInferencewith theJoint221matchingrowsandmatchingrows22121)row()row()()()|(EEEPPEPEEPEEP
    • Bayes Nets: Slide 40Copyright © 2001, Andrew W. MooreInferencewith theJoint221matchingrowsandmatchingrows22121)row()row()()()|(EEEPPEPEEPEEPP(Male | Poor) = 0.4654 / 0.7604 = 0.612
    • Bayes Nets: Slide 41Copyright © 2001, Andrew W. MooreJoint distributions• Good newsOnce you have a jointdistribution, you canask importantquestions aboutstuff that involves alot of uncertainty• Bad newsImpossible to createfor more than aboutten attributesbecause there areso many numbersneeded when youbuild the damn thing.
    • Bayes Nets: Slide 42Copyright © 2001, Andrew W. MooreUsing fewer numbersSuppose there are two events:• M: Manuela teaches the class (otherwise it’s Andrew)• S: It is sunnyThe joint p.d.f. for these events contain four entries.If we want to build the joint p.d.f. we’ll have to invent thosefour numbers. OR WILL WE??• We don’t have to specify with bottom level conjunctiveevents such as P(~M^S) IF…• …instead it may sometimes be more convenient for usto specify things like: P(M), P(S).But just P(M) and P(S) don’t derive the joint distribution. Soyou can’t answer all questions.
    • Bayes Nets: Slide 43Copyright © 2001, Andrew W. MooreUsing fewer numbersSuppose there are two events:• M: Manuela teaches the class (otherwise it’s Andrew)• S: It is sunnyThe joint p.d.f. for these events contain four entries.If we want to build the joint p.d.f. we’ll have to invent thosefour numbers. OR WILL WE??• We don’t have to specify with bottom level conjunctiveevents such as P(~M^S) IF…• …instead it may sometimes be more convenient for usto specify things like: P(M), P(S).But just P(M) and P(S) don’t derive the joint distribution. Soyou can’t answer all questions.
    • Bayes Nets: Slide 44Copyright © 2001, Andrew W. MooreIndependence“The sunshine levels do not depend on and do notinfluence who is teaching.”This can be specified very simply:P(S  M) = P(S)This is a powerful statement!It required extra domain knowledge. A different kindof knowledge than numerical probabilities. It neededan understanding of causation.
    • Bayes Nets: Slide 45Copyright © 2001, Andrew W. MooreIndependenceFrom P(S  M) = P(S), the rules of probability imply: (canyou prove these?)• P(~S  M) = P(~S)• P(M  S) = P(M)• P(M ^ S) = P(M) P(S)• P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S),P(~M^~S) = P(~M)P(~S)
    • Bayes Nets: Slide 46Copyright © 2001, Andrew W. MooreIndependenceFrom P(S  M) = P(S), the rules of probability imply: (canyou prove these?)• P(~S  M) = P(~S)• P(M  S) = P(M)• P(M ^ S) = P(M) P(S)• P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S),P(~M^~S) = P(~M)P(~S)And in general:P(M=u ^ S=v) = P(M=u) P(S=v)for each of the four combinations ofu=True/Falsev=True/False
    • Bayes Nets: Slide 47Copyright © 2001, Andrew W. MooreIndependenceWe’ve stated:P(M) = 0.6P(S) = 0.3P(S  M) = P(S)M S ProbT TT FF TF FAnd since we now have the joint pdf, we can makeany queries we like.From these statements, we canderive the full joint pdf.
    • Bayes Nets: Slide 48Copyright © 2001, Andrew W. MooreA more interesting case• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.Assume both lecturers are sometimes delayed by badweather. Andrew is more likely to arrive late than Manuela.
    • Bayes Nets: Slide 49Copyright © 2001, Andrew W. MooreA more interesting case• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.Assume both lecturers are sometimes delayed by badweather. Andrew is more likely to arrive late than Manuela.Let’s begin with writing down knowledge we’re happy about:P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6Lateness is not independent of the weather and is notindependent of the lecturer.
    • Bayes Nets: Slide 50Copyright © 2001, Andrew W. MooreA more interesting case• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.Assume both lecturers are sometimes delayed by badweather. Andrew is more likely to arrive late than Manuela.Let’s begin with writing down knowledge we’re happy about:P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6Lateness is not independent of the weather and is notindependent of the lecturer.We already know the Joint of S and M, so all we need now isP(L  S=u, M=v)in the 4 cases of u/v = True/False.
    • Bayes Nets: Slide 51Copyright © 2001, Andrew W. MooreA more interesting case• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.Assume both lecturers are sometimes delayed by badweather. Andrew is more likely to arrive late than Manuela.P(S  M) = P(S)P(S) = 0.3P(M) = 0.6P(L  M ^ S) = 0.05P(L  M ^ ~S) = 0.1P(L  ~M ^ S) = 0.1P(L  ~M ^ ~S) = 0.2Now we can derive a full jointp.d.f. with a “mere” six numbersinstead of seven**Savings are larger for larger numbers of variables.
    • Bayes Nets: Slide 52Copyright © 2001, Andrew W. MooreA more interesting case• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.Assume both lecturers are sometimes delayed by badweather. Andrew is more likely to arrive late than Manuela.P(S  M) = P(S)P(S) = 0.3P(M) = 0.6P(L  M ^ S) = 0.05P(L  M ^ ~S) = 0.1P(L  ~M ^ S) = 0.1P(L  ~M ^ ~S) = 0.2Question: ExpressP(L=x ^ M=y ^ S=z)in terms that only need the aboveexpressions, where x,y and z mayeach be True or False.
    • Bayes Nets: Slide 53Copyright © 2001, Andrew W. MooreA bit of notationP(S  M) = P(S)P(S) = 0.3P(M) = 0.6P(L  M ^ S) = 0.05P(L  M ^ ~S) = 0.1P(L  ~M ^ S) = 0.1P(L  ~M ^ ~S) = 0.2S MLP(s)=0.3P(M)=0.6P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
    • Bayes Nets: Slide 54Copyright © 2001, Andrew W. MooreA bit of notationP(S  M) = P(S)P(S) = 0.3P(M) = 0.6P(L  M ^ S) = 0.05P(L  M ^ ~S) = 0.1P(L  ~M ^ S) = 0.1P(L  ~M ^ ~S) = 0.2S MLP(s)=0.3P(M)=0.6P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2Read the absence of an arrowbetween S and M to mean “itwould not help me predict M if Iknew the value of S”Read the two arrows into L tomean that if I want to know thevalue of L it may help me toknow M and to know S.Thiskindofstuffwillbethoroughlyformalizedlater
    • Bayes Nets: Slide 55Copyright © 2001, Andrew W. MooreAn even cuter trickSuppose we have these three events:• M : Lecture taught by Manuela• L : Lecturer arrives late• R : Lecture concerns robotsSuppose:• Andrew has a higher chance of being late than Manuela.• Andrew has a higher chance of giving robotics lectures.What kind of independence can we find?How about:• P(L  M) = P(L) ?• P(R  M) = P(R) ?• P(L  R) = P(L) ?
    • Bayes Nets: Slide 56Copyright © 2001, Andrew W. MooreConditional independenceOnce you know who the lecturer is, then whetherthey arrive late doesn’t affect whether the lectureconcerns robots.P(R  M,L) = P(R  M) andP(R  ~M,L) = P(R  ~M)We express this in the following way:“R and L are conditionally independent given M”ML RGiven knowledge of M,knowing anything else inthe diagram won’t helpus with L, etc...which is also notatedby the followingdiagram.
    • Bayes Nets: Slide 57Copyright © 2001, Andrew W. MooreConditional Independence formalizedR and L are conditionally independent given M iffor all x,y,z in {T,F}:P(R=x  M=y ^ L=z) = P(R=x  M=y)More generally:Let S1 and S2 and S3 be sets of variables.Set-of-variables S1 and set-of-variables S2 areconditionally independent given S3 if for allassignments of values to the variables in the sets,P(S1’s assignments  S2’s assignments & S3’s assignments)=P(S1’s assignments  S3’s assignments)
    • Bayes Nets: Slide 58Copyright © 2001, Andrew W. MooreExample:R and L are conditionally independent given M iffor all x,y,z in {T,F}:P(R=x  M=y ^ L=z) = P(R=x  M=y)More generally:Let S1 and S2 and S3 be sets of variables.Set-of-variables S1 and set-of-variables S2 areconditionally independent given S3 if for allassignments of values to the variables in the sets,P(S1’s assignments  S2’s assignments & S3’s assignments)=P(S1’s assignments  S3’s assignments)“Shoe-size is conditionally independent of Glove-size givenheight weight and age”meansforall s,g,h,w,aP(ShoeSize=s|Height=h,Weight=w,Age=a)=P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)
    • Bayes Nets: Slide 59Copyright © 2001, Andrew W. MooreExample:R and L are conditionally independent given M iffor all x,y,z in {T,F}:P(R=x  M=y ^ L=z) = P(R=x  M=y)More generally:Let S1 and S2 and S3 be sets of variables.Set-of-variables S1 and set-of-variables S2 areconditionally independent given S3 if for allassignments of values to the variables in the sets,P(S1’s assignments  S2’s assignments & S3’s assignments)=P(S1’s assignments  S3’s assignments)“Shoe-size is conditionally independent of Glove-size givenheight weight and age”does not meanforall s,g,hP(ShoeSize=s|Height=h)=P(ShoeSize=s|Height=h, GloveSize=g)
    • Bayes Nets: Slide 60Copyright © 2001, Andrew W. MooreConditionalindependenceML RWe can write down P(M). And then, since we knowL is only directly influenced by M, we can writedown the values of P(LM) and P(L~M) and knowwe’ve fully specified L’s behavior. Ditto for R.P(M) = 0.6P(L  M) = 0.085P(L  ~M) = 0.17P(R  M) = 0.3P(R  ~M) = 0.6‘R and L conditionallyindependent given M’
    • Bayes Nets: Slide 61Copyright © 2001, Andrew W. MooreConditional independenceML RP(M) = 0.6P(L  M) = 0.085P(L  ~M) = 0.17P(R  M) = 0.3P(R  ~M) = 0.6Conditional Independence:P(RM,L) = P(RM),P(R~M,L) = P(R~M)Again, we can obtain any member of the Jointprob dist that we desire:P(L=x ^ R=y ^ M=z) =
    • Bayes Nets: Slide 62Copyright © 2001, Andrew W. MooreAssume five variablesT: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny• T only directly influenced by L (i.e. T isconditionally independent of R,M,S given L)• L only directly influenced by M and S (i.e. L isconditionally independent of R given M & S)• R only directly influenced by M (i.e. R isconditionally independent of L,S, given M)• M and S are independent
    • Bayes Nets: Slide 63Copyright © 2001, Andrew W. MooreMaking a Bayes netS MRLTStep One: add variables.• Just choose the variables you’d like to be included in thenet.T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
    • Bayes Nets: Slide 64Copyright © 2001, Andrew W. MooreMaking a Bayes netS MRLTStep Two: add links.• The link structure must be acyclic.• If node X is given parents Q1,Q2,..Qn you are promisingthat any variable that’s a non-descendent of X isconditionally independent of X given {Q1,Q2,..Qn}T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
    • Bayes Nets: Slide 65Copyright © 2001, Andrew W. MooreMaking a Bayes netS MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2Step Three: add a probability table for each node.• The table for node X must list P(X|Parent Values) for eachpossible combination of parent valuesT: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
    • Bayes Nets: Slide 66Copyright © 2001, Andrew W. MooreMaking a Bayes netS MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2• Two unconnected variables may still be correlated• Each node is conditionally independent of all non-descendants in the tree, given its parents.• You can deduce many other conditional independencerelations from a Bayes net. See the next lecture.T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
    • Bayes Nets: Slide 67Copyright © 2001, Andrew W. MooreBayes Nets FormalizedA Bayes net (also called a belief network) is anaugmented directed acyclic graph, represented bythe pair V , E where:• V is a set of vertices.• E is a set of directed edges joining vertices. Noloops of any length are allowed.Each vertex in V contains the following information:• The name of a random variable• A probability distribution table indicating how theprobability of this variable’s values depends onall possible combinations of parental values.
    • Bayes Nets: Slide 68Copyright © 2001, Andrew W. MooreBuilding a Bayes Net1. Choose a set of relevant variables.2. Choose an ordering for them3. Assume they’re called X1 .. Xm (where X1 is thefirst in the ordering, X1 is the second, etc)4. For i = 1 to m:1. Add the Xi node to the network2. Set Parents(Xi ) to be a minimal subset of{X1…Xi-1} such that we have conditionalindependence of Xi and all other members of{X1…Xi-1} given Parents(Xi )3. Define the probability table ofP(Xi =k  Assignments of Parents(Xi ) ).
    • Bayes Nets: Slide 69Copyright © 2001, Andrew W. MooreExample Bayes Net BuildingSuppose we’re building a nuclear power station.There are the following random variables:GRL : Gauge Reads Low.CTL : Core temperature is low.FG : Gauge is faulty.FA : Alarm is faultyAS : Alarm sounds• If alarm working properly, the alarm is meant tosound if the gauge stops reading a low temp.• If gauge working properly, the gauge is meant toread the temp of the core.
    • Bayes Nets: Slide 70Copyright © 2001, Andrew W. MooreComputing a Joint EntryHow to compute an entry in a joint distribution?E.G: What is P(S ^ ~M ^ L ~R ^ T)?S MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
    • Bayes Nets: Slide 71Copyright © 2001, Andrew W. MooreComputing with Bayes NetP(T ^ ~R ^ L ^ ~M ^ S) =P(T  ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) =P(T  L) * P(~R ^ L ^ ~M ^ S) =P(T  L) * P(~R  L ^ ~M ^ S) * P(L^~M^S) =P(T  L) * P(~R  ~M) * P(L^~M^S) =P(T  L) * P(~R  ~M) * P(L~M^S)*P(~M^S) =P(T  L) * P(~R  ~M) * P(L~M^S)*P(~M | S)*P(S) =P(T  L) * P(~R  ~M) * P(L~M^S)*P(~M)*P(S).S MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
    • Bayes Nets: Slide 72Copyright © 2001, Andrew W. MooreThe general caseP(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) =P(Xn=xn  Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) =P(Xn=xn  Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) *P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) =::=           niiiiniiiiiXxXPxXxXxXP111111ParentsofsAssignmentSo any entry in joint pdf table can be computed. And so anyconditional probability can be computed.
    • Bayes Nets: Slide 73Copyright © 2001, Andrew W. MooreWhere are we now?• We have a methodology for building Bayes nets.• We don’t require exponential storage to hold our probabilitytable. Only exponential in the maximum number of parentsof any node.• We can compute probabilities of any given assignment oftruth values to the variables. And we can do it in timelinear with the number of nodes.• So we can also compute answers to any questions.E.G. What could we do to compute P(R  T,~S)?S MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
    • Bayes Nets: Slide 74Copyright © 2001, Andrew W. MooreWhere are we now?• We have a methodology for building Bayes nets.• We don’t require exponential storage to hold our probabilitytable. Only exponential in the maximum number of parentsof any node.• We can compute probabilities of any given assignment oftruth values to the variables. And we can do it in timelinear with the number of nodes.• So we can also compute answers to any questions.E.G. What could we do to compute P(R  T,~S)?S MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2Step 1: Compute P(R ^ T ^ ~S)Step 2: Compute P(~R ^ T ^ ~S)Step 3: ReturnP(R ^ T ^ ~S)-------------------------------------P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
    • Bayes Nets: Slide 75Copyright © 2001, Andrew W. MooreWhere are we now?• We have a methodology for building Bayes nets.• We don’t require exponential storage to hold our probabilitytable. Only exponential in the maximum number of parentsof any node.• We can compute probabilities of any given assignment oftruth values to the variables. And we can do it in timelinear with the number of nodes.• So we can also compute answers to any questions.E.G. What could we do to compute P(R  T,~S)?S MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2Step 1: Compute P(R ^ T ^ ~S)Step 2: Compute P(~R ^ T ^ ~S)Step 3: ReturnP(R ^ T ^ ~S)-------------------------------------P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)Sum of all the rows in the Jointthat match R ^ T ^ ~SSum of all the rows in the Jointthat match ~R ^ T ^ ~S
    • Bayes Nets: Slide 76Copyright © 2001, Andrew W. MooreWhere are we now?• We have a methodology for building Bayes nets.• We don’t require exponential storage to hold our probabilitytable. Only exponential in the maximum number of parentsof any node.• We can compute probabilities of any given assignment oftruth values to the variables. And we can do it in timelinear with the number of nodes.• So we can also compute answers to any questions.E.G. What could we do to compute P(R  T,~S)?S MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2Step 1: Compute P(R ^ T ^ ~S)Step 2: Compute P(~R ^ T ^ ~S)Step 3: ReturnP(R ^ T ^ ~S)-------------------------------------P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)Sum of all the rows in the Jointthat match R ^ T ^ ~SSum of all the rows in the Jointthat match ~R ^ T ^ ~SEach of these obtained bythe “computing a jointprobability entry” method ofthe earlier slides4 joint computes4 joint computes
    • Bayes Nets: Slide 77Copyright © 2001, Andrew W. MooreThe good newsWe can do inference. We can compute anyconditional probability:P( Some variable  Some other variable values )221matchingentriesjointandmatchingentriesjoint22121)entryjoint()entryjoint()()()|(EEEPPEPEEPEEP
    • Bayes Nets: Slide 78Copyright © 2001, Andrew W. MooreThe good newsWe can do inference. We can compute anyconditional probability:P( Some variable  Some other variable values )221matchingentriesjointandmatchingentriesjoint22121)entryjoint()entryjoint()()()|(EEEPPEPEEPEEPSuppose you have m binary-valued variables in your BayesNet and expression E2 mentions k variables.How much work is the above computation?
    • Bayes Nets: Slide 79Copyright © 2001, Andrew W. MooreThe sad, bad newsConditional probabilities by enumerating all matching entriesin the joint are expensive:Exponential in the number of variables.
    • Bayes Nets: Slide 80Copyright © 2001, Andrew W. MooreThe sad, bad newsConditional probabilities by enumerating all matching entriesin the joint are expensive:Exponential in the number of variables.But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Netinference, you’ll find there are often many tricks to save youtime.• So we’ve just got to program our computer to do those trickstoo, right?
    • Bayes Nets: Slide 81Copyright © 2001, Andrew W. MooreThe sad, bad newsConditional probabilities by enumerating all matching entriesin the joint are expensive:Exponential in the number of variables.But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Netinference, you’ll find there are often many tricks to save youtime.• So we’ve just got to program our computer to do those trickstoo, right?Sadder and worse news:General querying of Bayes nets is NP-complete.
    • Bayes Nets: Slide 82Copyright © 2001, Andrew W. MooreBayes nets inference algorithmsA poly-tree is a directed acyclic graph in which no two nodes have more than onepath between them.A poly tree Not a poly tree(but still a legal Bayes net)SRLTLTMSMRX1X2X4X3X5X1 X2X3X5X4• If net is a poly-tree, there is a linear-time algorithm (see a laterAndrew lecture).• The best general-case algorithms convert a general net to a poly-tree (often at huge expense) and calls the poly-tree algorithm.• Another popular, practical approach (doesn’t assume poly-tree):Stochastic Simulation.
    • Bayes Nets: Slide 83Copyright © 2001, Andrew W. MooreSampling from the Joint DistributionIt’s pretty easy to generate a set of variable-assignments at random withthe same probability as the underlying joint distribution.How?S MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
    • Bayes Nets: Slide 84Copyright © 2001, Andrew W. MooreSampling from the Joint Distribution1. Randomly choose S. S = True with prob 0.32. Randomly choose M. M = True with prob 0.63. Randomly choose L. The probability that L is truedepends on the assignments of S and M. E.G. if steps1 and 2 had produced S=True, M=False, thenprobability that L is true is 0.14. Randomly choose R. Probability depends on M.5. Randomly choose T. Probability depends on LS MRLTP(s)=0.3P(M)=0.6P(RM)=0.3P(R~M)=0.6P(TL)=0.3P(T~L)=0.8P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
    • Bayes Nets: Slide 85Copyright © 2001, Andrew W. MooreA general sampling algorithmLet’s generalize the example on the previous slide to a general Bayes Net.As in Slides 16-17 , call the variables X1 .. Xn, where Parents(Xi) must be asubset of {X1 .. Xi-1}.For i=1 to n:1. Find parents, if any, of Xi. Assume n(i) parents. Call them Xp(i,1), Xp(i,2),…Xp(i,n(i)).2. Recall the values that those parents were randomly given: xp(i,1), xp(i,2),…xp(i,n(i)).3. Look up in the lookup-table for:P(Xi=True  Xp(i,1)=xp(i,1),Xp(i,2)=xp(i,2)…Xp(i,n(i))=xp(i,n(i)))4. Randomly set xi=True according to this probabilityx1, x2,…xn are now a sample from the joint distribution of X1, X2,…Xn.
    • Bayes Nets: Slide 86Copyright © 2001, Andrew W. MooreStochastic Simulation ExampleSomeone wants to know P(R = True  T = True ^ S = False )We’ll do lots of random samplings and count the number ofoccurrences of the following:• Nc : Num. samples in which T=True and S=False.• Ns : Num. samples in which R=True, T=True and S=False.• N : Number of random samplingsNow if N is big enough:Nc /N is a good estimate of P(T=True and S=False).Ns /N is a good estimate of P(R=True ,T=True , S=False).P(RT^~S) = P(R^T^~S)/P(T^~S), so Ns / Nc can be a goodestimate of P(RT^~S).
    • Bayes Nets: Slide 87Copyright © 2001, Andrew W. MooreGeneral Stochastic SimulationSomeone wants to know P(E1  E2 )We’ll do lots of random samplings and count the number ofoccurrences of the following:• Nc : Num. samples in which E2• Ns : Num. samples in which E1 and E2• N : Number of random samplingsNow if N is big enough:Nc /N is a good estimate of P(E2).Ns /N is a good estimate of P(E1 , E2).P(E1  E2) = P(E1^ E2)/P(E2), so Ns / Nc can be a good estimateof P(E1 E2).
    • Bayes Nets: Slide 88Copyright © 2001, Andrew W. MooreLikelihood weightingProblem with Stochastic Sampling:With lots of constraints in E, or unlikely events in E, then most of thesimulations will be thrown away, (they’ll have no effect on Nc, or Ns).Imagine we’re part way through our simulation.In E2 we have the constraint Xi = vWe’re just about to generate a value for Xi at random. Given the valuesassigned to the parents, we see that P(Xi = v  parents) = p .Now we know that with stochastic sampling:• we’ll generate “Xi = v” proportion p of the time, and proceed.• And we’ll generate a different value proportion 1-p of the time, and thesimulation will be wasted.Instead, always generate Xi = v, but weight the answer by weight “p” tocompensate.
    • Bayes Nets: Slide 89Copyright © 2001, Andrew W. MooreLikelihood weightingSet Nc :=0, Ns :=01. Generate a random assignment of all variables thatmatches E2. This process returns a weight w.2. Define w to be the probability that this assignment wouldhave been generated instead of an unmatchingassignment during its generation in the originalalgorithm.Fact: w is a product of all likelihood factorsinvolved in the generation.3. Nc := Nc + w4. If our sample matches E1 then Ns := Ns + w5. Go to 1Again, Ns / Nc estimates P(E1  E2 )
    • Bayes Nets: Slide 90Copyright © 2001, Andrew W. MooreCase Study IPathfinder system. (Heckerman 1991, Probabilistic Similarity Networks,MIT Press, Cambridge MA).• Diagnostic system for lymph-node diseases.• 60 diseases and 100 symptoms and test-results.• 14,000 probabilities• Expert consulted to make net.• 8 hours to determine variables.• 35 hours for net topology.• 40 hours for probability table values.• Apparently, the experts found it quite easy to invent the causal linksand probabilities.• Pathfinder is now outperforming the world experts in diagnosis. Beingextended to several dozen other medical domains.
    • Bayes Nets: Slide 91Copyright © 2001, Andrew W. MooreQuestions• What are the strengths of probabilistic networkscompared with propositional logic?• What are the weaknesses of probabilistic networkscompared with propositional logic?• What are the strengths of probabilistic networkscompared with predicate logic?• What are the weaknesses of probabilistic networkscompared with predicate logic?• (How) could predicate logic and probabilisticnetworks be combined?
    • Bayes Nets: Slide 92Copyright © 2001, Andrew W. MooreWhat you should know• The meanings and importance of independenceand conditional independence.• The definition of a Bayes net.• Computing probabilities of assignments ofvariables (i.e. members of the joint p.d.f.) with aBayes net.• The slow (exponential) method for computingarbitrary, conditional probabilities.• The stochastic simulation method and likelihoodweighting.